[kudu-CR] Spark connectors for Kudu

Anonymous Coward (Code Review) Fri, 29 Jan 2016 08:36:26 -0800

Anonymous Coward #125 has posted comments on this change.

Change subject: Spark connectors for Kudu
......................................................................



Patch Set 6:

(10 comments)

Mostly I just found nits -- my one bigger question is whether we should provide 
provide implicit extensions directly on RDDs (or maybe even on Dataframes) as 
was done here 
https://github.com/tmalaska/SparkOnKudu/blob/master/src/main/scala/org/kududb/spark/KuduRDDFunctions.scala
 (obviously we should remove the hbase references).  Even with implicits, I 
think we might be able to clean up the api even further, but I need to play 
with it a little bit.

Similarly, I think you could use implicits to add more support for kudu & 
dataframes (again I need to take a closer look).  Also seems like predicate 
pushdown is missing?

I dunno how important this stuff is for v1, or if you'd rather release for now 
and add that stuff in the next release.  It might be worth at least playing 
with what those apis would look like, to make sure that what is here isn't 
incompatible.

imran

http://gerrit.cloudera.org:8080/#/c/1788/6/java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala:

Line 20: import org.apache.spark.Logging
the Logging class isn't meant to a public part of spark (I think it was leaked 
accidentally, and now just has a comment to that effect).  In any case, looks 
like you aren't using it here at all.


Line 41:   val KUDU_MASTER: String = "kudu.master"
super nit, but its unusual to put in a type on simple declarations like this 
(certainly doesn't hurt, though)


Line 93:   @transient lazy val kuduSchema = kuduTable.getSchema
looks like you only need the columns, so maybe better to use

```
kuduColumns = kuduTable.getScema.getColumns.asScala
```


Line 95:     kuduSchema.getColumns.asScala.map(c => (c.getName, c)).toMap
here the types aren't as obvious for kuduTable, kuduSchema, and 
kuduSchemaColumnMap, so might be nice to add the type annotations.  (Maybe the 
types are obvious to anybody that knows kudu, in which case ignore this.)


http://gerrit.cloudera.org:8080/#/c/1788/6/java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala
File java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala:

Line 24: import org.apache.spark.{Logging, SparkContext}
same here on Logging


Line 38: class KuduContext(sc: SparkContext, kuduMaster: String) extends 
Serializable with Logging {
We probably want to expose all of these functions via implicits, so its more 
like the existing RDD apis.  eg., maybe we want to enable something like:

```
implicit val kuducontext = ...
rdd.kuduForEachPartition(f)
```
there is nothing wrong with the methods you have here (in fact, we'll probably 
want to keep them even if we add the others).  But adding methods like that 
will probably be nicer for users.  It could always be added in v2 if this needs 
to go out the door.

I need to think a bit more about exactly how the apis should look and play 
around with some alternatives.


Line 139:     rdd.values
it would be nice if this rdd had its name set to something meaningful, eg. 
kudu:<tableName>:<columnProjection> for when users see it in the UI.


http://gerrit.cloudera.org:8080/#/c/1788/6/java/kudu-spark/src/test/scala/org/kududb/spark/DefaultSourceSuite.scala
File java/kudu-spark/src/test/scala/org/kududb/spark/DefaultSourceSuite.scala:

Line 19: import org.apache.spark.Logging
no need for Logging


http://gerrit.cloudera.org:8080/#/c/1788/6/java/kudu-spark/src/test/scala/org/kududb/spark/TestContext.scala
File java/kudu-spark/src/test/scala/org/kududb/spark/TestContext.scala:

Line 53:     sc = new SparkContext("local", "test", null, Nil, envMap)
not a big deal, but I'd suggest at least using "local[2]" in your test cases, 
so spark is at least running two threads.  Probably more important for tests of 
spark itself, but you might as well do that.


Line 66:     sc.stop()
nit: should all of these be inside a "if (sc != null)" etc. check, in case 
something goes wrong before the spark context is even created?


-- 
To view, visit http://gerrit.cloudera.org:8080/1788
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic187513ef9724d50024f7401d7ecd19d53554245
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans
Gerrit-Reviewer: Anonymous Coward #125
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] Spark connectors for Kudu

Reply via email to