[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664628#comment-16664628 ] Felix Cheung commented on SPARK-16611: -- ping - we are going to consider removing RDD methods in spark 3.0.0 > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas >Priority: Major > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664630#comment-16664630 ] Felix Cheung commented on SPARK-16611: -- see SPARK-12172 > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas >Priority: Major > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196161#comment-16196161 ] Hyukjin Kwon commented on SPARK-16611: -- (Let me leave two JIRAs that I suspect are related - SPARK-12172 and SPARK-7230) > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408513#comment-15408513 ] Alok Singh commented on SPARK-16611: Hi [~shivaram] Thanks for the reply. 1)To illustrate what I meant by the broadcast varb issue. Please refer to the following example for example randomMatBr <- broadcast(sc, randomMat) worker <- function(r) {list(r[[0]] +1)} o1<- dapply(df, worker, out_sch) # case 1 o2<- lapply(df, worker) # case 2 useBroadcast <- function(x) { sum(value(randomMatBr) * x)} o3 <- lapply(df.toRdd, useBroadcast) # case3 Notes: - user intends to use the case3 so he created the broadcast array. but he also want to compute either o1 or o2 (for other use cases). so in the case1 and case2, he know that he will never use the broadcast elements. but in the case1, the framework will anyway ship the element in the ls(broadcastArr) to each nodes. in case2, it won't. 2) If one has one way of getting the RDD from dataframe i.e toRDD as you had suggested, it would be great :) but is it going to work with the pipeline RDD,df too? Here is one example to illustrate the point # read.csv custom parseFields <- function(record) { Sys.setlocale("LC_ALL", "C") # necessary for strsplit() to work correctly nrecord<- as.character(record); parts <- strsplit(nrecord, ",")[[1]] list(id=parts[1], title=parts[2], modified=parts[3], text=parts[4], username=parts[5]) } pr=SparkR:::lapply(f, parseFields) cache(pr) pr sch=structType(structField("id", "string"), structField("title", "string"), structField("modified", "string"), structField("text", "string"), structField("username", "string")) air_df <- createDataFrame(sqlContext, pr, sch) # now we pass in air_df's RDD to systemML the current air_df is the pipeline df and getJRDD will returns the proper RDD but if we use toRDD . my last experiment didn't work properly. # please note that, in 2.0 we will have read.csv but the point is that user can have any pipelined RDD and dataframe. does toRDD also will work with pipeline RDD,dataframe? Thanks for the confirmation that, we are not removing the RDD yet and only rename is the goal :) Alok > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398346#comment-15398346 ] Clark Fitzgerald commented on SPARK-16611: -- +1 for more direct access to the RDD's. This would be very helpful for me as I try to implement general R objects using Spark as a backend for ddR https://github.com/vertica/ddR Longer term it might make sense to organize SparkR into separate packages offering various levels of abstraction: # dataframes - for most end users # RDD's - for package authors or special applications # Java objects - for directly invoking methods in Spark. This is what sparkapi does. For my application it would be much better to be working at this middle layer. > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397827#comment-15397827 ] Shivaram Venkataraman commented on SPARK-16611: --- 1. lapply: From an API perspective we can add an lapply that is implemented using dapply. i.e. dapply runs on each partition with input as a data.frame while lapply could operate on each row. Also dapply should not shuffle any data and should work the same as lapplyPartition in terms of execution. It would be more interesting to know if dapply is somehow worse in terms of functionality. Could you explain what the issue with useBroadcast is ? From what I can see the code path is same in RDD.R and DataFrame.R 2. getJRDD: So it looks like the function required here is a way to extract the java rdd object from the Dataframe ? In that case there is a `toRDD` function in Dataset.scala that we should be able to expose (I think it should just involve using `callJMethod(df@sdf, "toRDD")`). This can be made to return a jobj instead of a RDD object in R. 3. Thanks for explaining this -- I think we can expose cleanup.jobj as a public function to be used to register finalizers. This JIRA isnt about removing anything yet -- but short term we will need to remove / rename parts of the R api of RDDs to satisfy CRAN checks (See SPARK-16519). Longer term it would be great to have one code path that is well maintained, so knowing what doesn't work with dapply family of functions will be very useful. > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394485#comment-15394485 ] Alok Singh commented on SPARK-16611: Hi [~shivaram] Sorry for late reply due to vacation. Here is the detail answers a) lapply or map lapplyPartition or mapPartition We have current pre-processing codes (i.e recode, na impute, outlier etc.) implemented that is different that what spark currently provides (but currently sparkR doesn't have it) and hence we uses the apply on dataframe lapplyPartition was added in the list since we don’t want to re-shuffle data for efficiency and would prefer for certain cases lappyPartition instead of lapply b) flatMap: not the high priority we can live without it :) c) RDD:toRDD, getJRDD (i.e RDD api) Since SystemML uses and works internally in the binary block matrix format and it’s api is based on the JavaRDD apis and out R ext takes sparkR data frame and extract the RDD and pass it to the systemML java api. Note that the use of RDD is discouraged from spark as using data frame enables one to use all the catalyst optimizer features. However, in our case we are sure that we will just get the RDD and convert to the binary block matrix so that systemML can consumes and do the heavy lifting. d) cleanup.jobj: SystemML uses the MLContext and matrixCharacteristic class that is instantiated in JVM and whose object ref is kept alive in the sparkR and later when systemML has done it’s computation. we cleanup the objects. The way we achieve it using the References classes in R and use it’s finalize method to register the cleanup.jobj once we have created the jobj via newJObject(“sysml.class”) In general, I think goal our DataFrame only api is great but removing RDD apis 100% would have many issues with out ext package on the top of sparkR. Can we continue to keep them private (if we can't converse to the decision) for now? For using dapply, one concern, we have is that dapplyInternal always do the broadcast of the variables set via useBroadcast. However, in many cases lapply is what one needs as user know for sure that he will not be using the broadcast vars. Also lapply falls naturally in the R syntax. Thanks Alok > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392493#comment-15392493 ] Shivaram Venkataraman commented on SPARK-16611: --- [~adrian555] dapply on DataFrames should have the same functionality as lapply on RDD > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392360#comment-15392360 ] Weiqiang Zhuang commented on SPARK-16611: - [~felixcheung] Alok is out of town and will be returning this week. We are currently on Spark 1.6.1. If you are removing the RDD functions from 2.0+, I think we are fine so far. But, as you suggested spark.lappy function, does it have the same functionality as the current lapply function? Thanks. > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390856#comment-15390856 ] Felix Cheung commented on SPARK-16611: -- where are we on this? I'm working on removing the RDD functions SPARK-16519 > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385033#comment-15385033 ] Felix Cheung commented on SPARK-16611: -- there's also spark.lapply > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383206#comment-15383206 ] Weiqiang Zhuang commented on SPARK-16611: - To answer @shivaram's question: we are calling lapply function to transform the dataset so that the system ml can run algorithms on it. The lapply accepts RDD. Hence the requirement for the exposure of these APIs and data types. We will investigate whether the dapply and gapply will work for the same purpose. > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383050#comment-15383050 ] Shivaram Venkataraman commented on SPARK-16611: --- I think its a bit different as SPARK-16581 is only taking about the functions we have to call into JVM from R. [~olarayej] Could you clarify what are some of the use cases you have in mind for these functions ? As a project we made the high level decision to only expose DataFrames and not RDDs to external users. To offset for the lack of the RDD API we now have UDFs on DataFrames in the form of dapply and gapply. It would be great to know if your use cases can be met with dapply / gapply and if not we can try to see what is missing from them. That would be easier than opening up the entire RDD API. > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383047#comment-15383047 ] Felix Cheung commented on SPARK-16611: -- Is this SPARK-16581? > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org