Jean-Daniel Cryans has posted comments on this change.

Change subject: Spark connectors for Kudu
......................................................................


Patch Set 1:

(27 comments)

http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/pom.xml
File java/kudu-spark/pom.xml:

Line 108:                     <groupId>org.jruby</groupId>
> I find it surprising you have to have these exclusions, because AFAIK hadoo
Very likely, I'll remove. Better to start clean.


Line 214:                                 -Xmx1536m -XX:MaxPermSize=512m 
-XX:ReservedCodeCacheSize=512m
> curious where this stuff got cargo-culted from :)
Works without them, so I'll remove.


http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala:

Line 40:    * Is given input from SparkSQL to construct a BaseRelation.
> style: not really proper javadoc
Done


Line 50:       new Throwable("Invalid value for " + TABLE_KEY +" '" + tableName 
+ "'")
> hrm... I dont know Scala really, but this seems to be a no-op.. you constru
Done


Line 66:                     val kuduMaster: String) (
> inconsistent style: spaces or not after the ':'?
Sorry, should have run a reformat of everything before posting.


Line 70:   // Create or get latest KuduContext.
> what's "latest"?
That comment is wrong, it's about the kuduClient, not the context. I'll just 
remove it, we're not doing anything complicated here.


Line 82:       kuduSchemaColumnMap = buildKuduSchemaColumnMap(kuduSchema)
> not really following this block - isn't this duplicate with the various cla
Took me a while to understand. Since those are transient vars, they won't be 
re-set after deserialization. Using lazy vals fixes the issue and I can get rid 
of the extra code.


Line 103:    * @return schema generated from the SCHEMA_COLUMNS_MAPPING_KEY 
value
> this is hbase cruft
Done


Line 111:     val columnIt = kuduSchema.getColumns.iterator()
> why explicitly use an iterator instead of a more normal foreach loop? or ma
I don't know much about scala but zipWithIndex looks fun. The way you wrote it 
makes it zip a list of lists, instead this works well:
kuduSchema.getColumns.asScala.zipWithIndex foreach { e =>


Line 112:     var indexCounter = 0
> 'indexCounter' is unnecessarily verbose. 'colIdx' perhaps
Done


Line 116:       val columnSparkSqlType = if (c.getType.equals(Type.BOOL)) 
BooleanType
> seems like we should have a static final hashmap which maps from kudu types
Done


Line 125:       else throw new Throwable("Unsupported column type :" + 
c.getType)
> nit: should be "Unsupported column type: " (misplaced space and colon)
Done


Line 129:         new StructField(c.getName, columnSparkSqlType, nullable = 
true, metadata)
> shouldn't we pass down nullability from the kudu column schema?
Indeed


Line 136:     result
> is this common scala style vs just ending with 'new StructType(structFieldA
I actually don't know, doesn't seem like a big deal though. Will do the shorter 
version.


Line 141:    * Here is where we will do the following:
> weird style. "here we are" isn't normal javadoc
Done


Line 148:    * @return                RDD will all the results from HBase 
needed for SparkSQL to
> Kudu, not HBase
Done


Line 155:     if (resultRDD == null) {
> what's the point of this if? you just set it to null
Artefact from SparkOnKudu.


Line 163:       })
> I think the above 5 lines could be something like:
woah that's cool


Line 168:         Row.fromSeq(requiredColumns.map(c => getKuduValue(c, 
rowResults)))
> perf-wise, it would be better to do a mapping above from column names to co
Done


Line 179:     val columnType = columnSchema.getType
> this is fishy because above you use getOrElse(, null) and here you assume i
Done


Line 190:     else if (columnType == Type.STRING) row.getString(columnName)
> worht an else and throw an error. also again a static final mapping would b
Done


http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala
File java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala:

Line 129:   def kuduRDD(tableName: String, columnProjection: String = null):
> is this a public API? needs some doc. taking the string version of columnPr
Done


Line 133:     conf.set("kudu.mapreduce.master.address", kuduMaster)
> would be good to use public APIs here for setters instead of hard-coding th
The require a job instead of a config, oops. Regardless, we want to ditch the 
hadoop RDD as soon as yesterday.


Line 142:     rdd
> seems like you might want the kuduRDD to not be key/value pairs, but rather
omg yes


Line 147:    *  Underlining wrapper all foreach functions in KuduContext.
> underlying?
Done


http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/src/main/scala/org/kududb/spark/KuduDStreamFunctions.scala
File java/kudu-spark/src/main/scala/org/kududb/spark/KuduDStreamFunctions.scala:

Line 28: object KuduDStreamFunctions {
> maybe we should delay the Streaming stuff to a second commit?
It's not like it's adding tons of code. What's your concern?


http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/src/test/scala/org/kududb/spark/KuduContextSuite.scala
File java/kudu-spark/src/test/scala/org/kududb/spark/KuduContextSuite.scala:

Line 80:     val scanList = scanRdd.map(r => r._2.getInt(0)).collect()
> yea, here is where it becomes obvious that it's goofy that we return (null,
Yup, fixed.


-- 
To view, visit http://gerrit.cloudera.org:8080/1788
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic187513ef9724d50024f7401d7ecd19d53554245
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to