[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394485#comment-15394485 ]
Alok Singh commented on SPARK-16611: ------------------------------------ Hi [~shivaram] Sorry for late reply due to vacation. Here is the detail answers a) lapply or map lapplyPartition or mapPartition We have current pre-processing codes (i.e recode, na impute, outlier etc.) implemented that is different that what spark currently provides (but currently sparkR doesn't have it) and hence we uses the apply on dataframe lapplyPartition was added in the list since we don’t want to re-shuffle data for efficiency and would prefer for certain cases lappyPartition instead of lapply b) flatMap: not the high priority we can live without it :) c) RDD:toRDD, getJRDD (i.e RDD api) Since SystemML uses and works internally in the binary block matrix format and it’s api is based on the JavaRDD apis and out R ext takes sparkR data frame and extract the RDD and pass it to the systemML java api. Note that the use of RDD is discouraged from spark as using data frame enables one to use all the catalyst optimizer features. However, in our case we are sure that we will just get the RDD and convert to the binary block matrix so that systemML can consumes and do the heavy lifting. d) cleanup.jobj: SystemML uses the MLContext and matrixCharacteristic class that is instantiated in JVM and whose object ref is kept alive in the sparkR and later when systemML has done it’s computation. we cleanup the objects. The way we achieve it using the References classes in R and use it’s finalize method to register the cleanup.jobj once we have created the jobj via newJObject(“sysml.class”) In general, I think goal our DataFrame only api is great but removing RDD apis 100% would have many issues with out ext package on the top of sparkR. Can we continue to keep them private (if we can't converse to the decision) for now? For using dapply, one concern, we have is that dapplyInternal always do the broadcast of the variables set via useBroadcast. However, in many cases lapply is what one needs as user know for sure that he will not be using the broadcast vars. Also lapply falls naturally in the R syntax. Thanks Alok > Expose several hidden DataFrame/RDD functions > --------------------------------------------- > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR > Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org