[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

Alok Singh (JIRA) Tue, 26 Jul 2016 13:28:39 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394485#comment-15394485
 ]


Alok Singh commented on SPARK-16611:
------------------------------------

Hi [~shivaram]

Sorry for late reply due to vacation. Here is the detail answers


a) lapply or map
  lapplyPartition or mapPartition
 
We have current pre-processing codes (i.e recode, na impute, outlier etc.)  
implemented that is different that what spark currently provides (but currently 
sparkR doesn't have it)  and hence we uses the apply on dataframe
lapplyPartition was added in the list since we don’t want to re-shuffle data 
for efficiency and would prefer for certain cases lappyPartition instead of 
lapply

b) flatMap:
not the high priority we can live without it :)

c) RDD:toRDD, getJRDD (i.e RDD api)

Since SystemML uses and works internally in the binary block matrix format and 
it’s api is based on the JavaRDD apis and out R ext takes sparkR data frame and 
extract the RDD and pass it to the systemML java api. Note that the use of RDD 
is discouraged from spark as using data frame enables one to use all the 
catalyst optimizer features. However, in our case we are sure that we will just 
get the RDD and convert to the binary block matrix so that systemML can 
consumes and do the heavy lifting.

d) cleanup.jobj:
SystemML uses the MLContext and matrixCharacteristic class that is instantiated 
in JVM and whose object ref is kept alive in the sparkR and later when systemML 
has done it’s computation. we cleanup the objects. The way we achieve it using 
the References classes in R and use it’s finalize method to register the 
cleanup.jobj once we have created the jobj via newJObject(“sysml.class”)



In general, I think goal our DataFrame only api is great but removing RDD apis 
100% would have many issues with out ext package on the top of sparkR. Can we 
continue to keep them  private (if we can't converse to the decision) for now?


For using dapply, one concern, we  have is that dapplyInternal always do the 
broadcast of the variables set via useBroadcast. However, in many cases lapply 
is what one needs as user know for sure that he will not be using the broadcast 
vars. Also lapply falls naturally in the R syntax. 

Thanks
Alok


> Expose several hidden DataFrame/RDD functions
> ---------------------------------------------
>
>                 Key: SPARK-16611
>                 URL: https://issues.apache.org/jira/browse/SPARK-16611
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>            Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

Reply via email to