Todd Lipcon has posted comments on this change.

Change subject: Spark connectors for Kudu
......................................................................


Patch Set 4:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/pom.xml
File java/kudu-spark/pom.xml:

Line 92:             <artifactId>hadoop-client</artifactId>
hadoop-client includes stuff like the HDFS client which we shouldn't need. I 
think we can probably get by with hadoop-common (and maybe some set of 
exclusions?) Best I can tell we only use Configuration and NullWritable which 
are both in the common jar and shouldn't have any real deps.. everything else I 
think we should pick up transitively from kudu-mapreduce above


http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala:

Line 52:       throw new IllegalArgumentException(s"Invalid value for 
$TABLE_KEY '$tableName'")
woah. mind blown.


Line 90:     kuduSchema.getColumns.asScala.map({ c => c.getName -> c }).toMap
this is my scala inexperience talking, but I find this hard to know whether => 
or -> has higher precedence. could you add some parents to disambiguate? if I'm 
reading it correctly it's c => (c.getName -> c)


Line 118:    * @param requiredColumns clumns that are being requested by the 
requesting query
typo


Line 129:   def getKuduValue(row: RowResult, columnName: String): Any = {
can this be private?


http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala
File java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala:

Line 144:     f(it, syncClient, asyncClient)
I think this causes both of the clients to end up evaluated, which constructs 
parallel clients (and likely twice as many nettys, etc)... based on 
http://matt.might.net/articles/implementing-laziness/, you'd need to change the 
parameters to be "by-name" parameters, whatever that means.

Given that the async client is also considered a bit unstable, maybe we should 
have separate functions, one that yields async, and one that yields sync? or 
just do the sync one, since you could probably have an API that grabs the async 
out of the sync wrapper?


http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/src/test/scala/org/kududb/spark/DefaultSourceSuite.scala
File java/kudu-spark/src/test/scala/org/kududb/spark/DefaultSourceSuite.scala:

Line 36:     assert(sqlContext.sql("SELECT * FROM " + 
tableName).collectAsList().size() == rowCount)
I think we should test some basic value-dependent queries like SUM and 
GROUP_CONCAT(strings) -- there's a lot of subtlety in the fact that our MR 
input format will reuse the same rowresult object, and we should make sure that 
Spark respects this correctly


http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/src/test/scala/org/kududb/spark/TestContext.scala
File java/kudu-spark/src/test/scala/org/kududb/spark/TestContext.scala:

Line 42:       new ColumnSchemaBuilder("c2_s", Type.STRING).build())
should test some nulls


-- 
To view, visit http://gerrit.cloudera.org:8080/1788
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic187513ef9724d50024f7401d7ecd19d53554245
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to