Anonymous Coward #125 has posted comments on this change. Change subject: Spark connectors for Kudu ......................................................................
Patch Set 6: (10 comments) Mostly I just found nits -- my one bigger question is whether we should provide provide implicit extensions directly on RDDs (or maybe even on Dataframes) as was done here https://github.com/tmalaska/SparkOnKudu/blob/master/src/main/scala/org/kududb/spark/KuduRDDFunctions.scala (obviously we should remove the hbase references). Even with implicits, I think we might be able to clean up the api even further, but I need to play with it a little bit. Similarly, I think you could use implicits to add more support for kudu & dataframes (again I need to take a closer look). Also seems like predicate pushdown is missing? I dunno how important this stuff is for v1, or if you'd rather release for now and add that stuff in the next release. It might be worth at least playing with what those apis would look like, to make sure that what is here isn't incompatible. imran http://gerrit.cloudera.org:8080/#/c/1788/6/java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala File java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala: Line 20: import org.apache.spark.Logging the Logging class isn't meant to a public part of spark (I think it was leaked accidentally, and now just has a comment to that effect). In any case, looks like you aren't using it here at all. Line 41: val KUDU_MASTER: String = "kudu.master" super nit, but its unusual to put in a type on simple declarations like this (certainly doesn't hurt, though) Line 93: @transient lazy val kuduSchema = kuduTable.getSchema looks like you only need the columns, so maybe better to use ``` kuduColumns = kuduTable.getScema.getColumns.asScala ``` Line 95: kuduSchema.getColumns.asScala.map(c => (c.getName, c)).toMap here the types aren't as obvious for kuduTable, kuduSchema, and kuduSchemaColumnMap, so might be nice to add the type annotations. (Maybe the types are obvious to anybody that knows kudu, in which case ignore this.) http://gerrit.cloudera.org:8080/#/c/1788/6/java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala File java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala: Line 24: import org.apache.spark.{Logging, SparkContext} same here on Logging Line 38: class KuduContext(sc: SparkContext, kuduMaster: String) extends Serializable with Logging { We probably want to expose all of these functions via implicits, so its more like the existing RDD apis. eg., maybe we want to enable something like: ``` implicit val kuducontext = ... rdd.kuduForEachPartition(f) ``` there is nothing wrong with the methods you have here (in fact, we'll probably want to keep them even if we add the others). But adding methods like that will probably be nicer for users. It could always be added in v2 if this needs to go out the door. I need to think a bit more about exactly how the apis should look and play around with some alternatives. Line 139: rdd.values it would be nice if this rdd had its name set to something meaningful, eg. kudu:<tableName>:<columnProjection> for when users see it in the UI. http://gerrit.cloudera.org:8080/#/c/1788/6/java/kudu-spark/src/test/scala/org/kududb/spark/DefaultSourceSuite.scala File java/kudu-spark/src/test/scala/org/kududb/spark/DefaultSourceSuite.scala: Line 19: import org.apache.spark.Logging no need for Logging http://gerrit.cloudera.org:8080/#/c/1788/6/java/kudu-spark/src/test/scala/org/kududb/spark/TestContext.scala File java/kudu-spark/src/test/scala/org/kududb/spark/TestContext.scala: Line 53: sc = new SparkContext("local", "test", null, Nil, envMap) not a big deal, but I'd suggest at least using "local[2]" in your test cases, so spark is at least running two threads. Probably more important for tests of spark itself, but you might as well do that. Line 66: sc.stop() nit: should all of these be inside a "if (sc != null)" etc. check, in case something goes wrong before the spark context is even created? -- To view, visit http://gerrit.cloudera.org:8080/1788 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic187513ef9724d50024f7401d7ecd19d53554245 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Anonymous Coward #125 Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
