[ 
https://issues.apache.org/jira/browse/DATAFU-148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583146#comment-16583146
 ] 

Matthew Hayes commented on DATAFU-148:
--------------------------------------

Thanks [~eyal] and [~uzadude] for submitting this and setting up the initial 
Spark subproject.  This looks like a great start.  I look forward to seeing 
more of the Spark code you have.  I reviewed the code and have the following 
comments:

In SparkDFUtils.scala:
- dedup2 could use some additional description to differentiate it from dedup.
- flatten is missing documentation
- for broadcastJoinSkewed, the description of the numberCustsToBroadcast field 
isn't clear to me
- joinWithRange could use some more documentation. For example, the fields are 
not all documented. It's not immediately obvious to me what DECREASE_FACTOR 
does and why it should have a default value of 2^8.
- Also, joinWithRange seems characteristically different from the other in this 
file, as it's a bit more use-case specific. Maybe later it would make sense to 
move this to a separate file.

In build.gradle:
- The download plugin isn't needed.
- Is autojarring necessary? Looking at the contents of the datafu-spark jar, we 
only have datafu.spark and org.apache.spark classes. It seems like 
org.apache.spark classes shouldn't need to be included. Also the build.gradle 
autojars commonsmath and guava, which aren't used. It seems all this jarjar and 
autojar stuff could be stripped out of this file.

flattten and changeSchema should have tests I think.

A question regarding documentation: people would generally by using these via 
{{DataFrameOps}}, so it would probably be helpful to have doc links in those 
methods to the underlying implementation.  Is the reason {{SparkDFUtils}} is 
split out into a separate file so that it can be used in the future by other 
methods?  By the way, I found out you can generate the docs with the command 
below.  Before including this in a release it would be good to review the 
generated docs and see where they can be improved.  For example, the packages 
and objects don't have docs.
{code:java}
./gradlew :datafu-spark:scaladoc
{code}
Also, if we were to merge this in somewhere it should probably go into a new 
pending release branch like 2.0.0 so we can continue to work on getting it 
ready independent of short term releases.  I think this should trigger a major 
version bump since it is a new sub-project and gives us the chance to clean up 
anything we've deprecated. Thoughts?

> Setup Spark sub-project
> -----------------------
>
>                 Key: DATAFU-148
>                 URL: https://issues.apache.org/jira/browse/DATAFU-148
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>
> Create a skeleton Spark sub project for Spark code to be contributed to DataFu



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to