Jean-Daniel Cryans has posted comments on this change. Change subject: Spark connectors for Kudu ......................................................................
Patch Set 1: (27 comments) http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/pom.xml File java/kudu-spark/pom.xml: Line 108: <groupId>org.jruby</groupId> > I find it surprising you have to have these exclusions, because AFAIK hadoo Very likely, I'll remove. Better to start clean. Line 214: -Xmx1536m -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m > curious where this stuff got cargo-culted from :) Works without them, so I'll remove. http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala File java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala: Line 40: * Is given input from SparkSQL to construct a BaseRelation. > style: not really proper javadoc Done Line 50: new Throwable("Invalid value for " + TABLE_KEY +" '" + tableName + "'") > hrm... I dont know Scala really, but this seems to be a no-op.. you constru Done Line 66: val kuduMaster: String) ( > inconsistent style: spaces or not after the ':'? Sorry, should have run a reformat of everything before posting. Line 70: // Create or get latest KuduContext. > what's "latest"? That comment is wrong, it's about the kuduClient, not the context. I'll just remove it, we're not doing anything complicated here. Line 82: kuduSchemaColumnMap = buildKuduSchemaColumnMap(kuduSchema) > not really following this block - isn't this duplicate with the various cla Took me a while to understand. Since those are transient vars, they won't be re-set after deserialization. Using lazy vals fixes the issue and I can get rid of the extra code. Line 103: * @return schema generated from the SCHEMA_COLUMNS_MAPPING_KEY value > this is hbase cruft Done Line 111: val columnIt = kuduSchema.getColumns.iterator() > why explicitly use an iterator instead of a more normal foreach loop? or ma I don't know much about scala but zipWithIndex looks fun. The way you wrote it makes it zip a list of lists, instead this works well: kuduSchema.getColumns.asScala.zipWithIndex foreach { e => Line 112: var indexCounter = 0 > 'indexCounter' is unnecessarily verbose. 'colIdx' perhaps Done Line 116: val columnSparkSqlType = if (c.getType.equals(Type.BOOL)) BooleanType > seems like we should have a static final hashmap which maps from kudu types Done Line 125: else throw new Throwable("Unsupported column type :" + c.getType) > nit: should be "Unsupported column type: " (misplaced space and colon) Done Line 129: new StructField(c.getName, columnSparkSqlType, nullable = true, metadata) > shouldn't we pass down nullability from the kudu column schema? Indeed Line 136: result > is this common scala style vs just ending with 'new StructType(structFieldA I actually don't know, doesn't seem like a big deal though. Will do the shorter version. Line 141: * Here is where we will do the following: > weird style. "here we are" isn't normal javadoc Done Line 148: * @return RDD will all the results from HBase needed for SparkSQL to > Kudu, not HBase Done Line 155: if (resultRDD == null) { > what's the point of this if? you just set it to null Artefact from SparkOnKudu. Line 163: }) > I think the above 5 lines could be something like: woah that's cool Line 168: Row.fromSeq(requiredColumns.map(c => getKuduValue(c, rowResults))) > perf-wise, it would be better to do a mapping above from column names to co Done Line 179: val columnType = columnSchema.getType > this is fishy because above you use getOrElse(, null) and here you assume i Done Line 190: else if (columnType == Type.STRING) row.getString(columnName) > worht an else and throw an error. also again a static final mapping would b Done http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala File java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala: Line 129: def kuduRDD(tableName: String, columnProjection: String = null): > is this a public API? needs some doc. taking the string version of columnPr Done Line 133: conf.set("kudu.mapreduce.master.address", kuduMaster) > would be good to use public APIs here for setters instead of hard-coding th The require a job instead of a config, oops. Regardless, we want to ditch the hadoop RDD as soon as yesterday. Line 142: rdd > seems like you might want the kuduRDD to not be key/value pairs, but rather omg yes Line 147: * Underlining wrapper all foreach functions in KuduContext. > underlying? Done http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/src/main/scala/org/kududb/spark/KuduDStreamFunctions.scala File java/kudu-spark/src/main/scala/org/kududb/spark/KuduDStreamFunctions.scala: Line 28: object KuduDStreamFunctions { > maybe we should delay the Streaming stuff to a second commit? It's not like it's adding tons of code. What's your concern? http://gerrit.cloudera.org:8080/#/c/1788/1/java/kudu-spark/src/test/scala/org/kududb/spark/KuduContextSuite.scala File java/kudu-spark/src/test/scala/org/kududb/spark/KuduContextSuite.scala: Line 80: val scanList = scanRdd.map(r => r._2.getInt(0)).collect() > yea, here is where it becomes obvious that it's goofy that we return (null, Yup, fixed. -- To view, visit http://gerrit.cloudera.org:8080/1788 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic187513ef9724d50024f7401d7ecd19d53554245 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
