[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2018-10-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664628#comment-16664628
 ] 

Felix Cheung commented on SPARK-16611:
--

ping - we are going to consider removing RDD methods in spark 3.0.0

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Priority: Major
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2018-10-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664630#comment-16664630
 ] 

Felix Cheung commented on SPARK-16611:
--

see SPARK-12172

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Priority: Major
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2017-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196161#comment-16196161
 ] 

Hyukjin Kwon commented on SPARK-16611:
--

(Let me leave two JIRAs that I suspect are related - SPARK-12172 and SPARK-7230)

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-08-04 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408513#comment-15408513
 ] 

Alok Singh commented on SPARK-16611:


Hi [~shivaram]

 Thanks for the reply.

1)To illustrate what I meant by the broadcast varb issue. Please refer to the 
following example

for example 
randomMatBr <- broadcast(sc, randomMat)
   
   worker <- function(r) {list(r[[0]] +1)}
   o1<- dapply(df, worker, out_sch) # case 1

   o2<- lapply(df, worker) # case 2
   
  useBroadcast <- function(x) {  sum(value(randomMatBr) * x)}
  o3 <- lapply(df.toRdd, useBroadcast) # case3

 Notes:
  - user intends to use the case3 so he created the broadcast array. but he 
also want to compute either o1 or o2 (for other use cases). so in the case1 and 
case2, he know that he will never use the broadcast elements. but in the case1, 
the framework will anyway ship the element in the ls(broadcastArr) to each 
nodes.
in case2, it won't.

2) If one has one way of getting the RDD from dataframe i.e toRDD as you had 
suggested, it would be great :)
  but is it going to work with the pipeline RDD,df too?

 Here is one example to illustrate the point
 # read.csv custom
 
 parseFields <- function(record) {
   Sys.setlocale("LC_ALL", "C") # necessary for strsplit() to work correctly
   nrecord<- as.character(record); parts <- strsplit(nrecord, ",")[[1]]
   list(id=parts[1], title=parts[2], modified=parts[3], text=parts[4], 
username=parts[5]) }

  pr=SparkR:::lapply(f, parseFields)
  cache(pr)
  pr
  sch=structType(structField("id", "string"), structField("title", "string"), 
structField("modified", "string"), structField("text", "string"), 
structField("username", "string"))
  air_df <- createDataFrame(sqlContext, pr, sch)


  # now we pass in air_df's RDD to systemML
  the current air_df is the pipeline df and getJRDD will returns the proper RDD 
but if we use toRDD . my last experiment didn't work properly.
 # please note that, in 2.0 we will have read.csv but the point is that user 
can have any pipelined RDD and dataframe. does toRDD also will work with 
pipeline RDD,dataframe?




Thanks for the confirmation that, we are not removing the RDD yet and only 
rename is the goal :)

Alok

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-28 Thread Clark Fitzgerald (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398346#comment-15398346
 ] 

Clark Fitzgerald commented on SPARK-16611:
--

+1 for more direct access to the RDD's. This would be very helpful for me as I 
try to implement general R objects using Spark as a backend for ddR 
https://github.com/vertica/ddR

Longer term it might make sense to organize SparkR into separate packages 
offering various levels of abstraction:
# dataframes - for most end users
# RDD's - for package authors or special applications
# Java objects - for directly invoking methods in Spark. This is what sparkapi 
does.

For my application it would be much better to be working at this middle layer.

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397827#comment-15397827
 ] 

Shivaram Venkataraman commented on SPARK-16611:
---

1. lapply: From an API perspective we can add an lapply that is implemented 
using dapply. i.e. dapply runs on each partition with input as a data.frame 
while lapply could operate on each row. Also dapply should not shuffle any data 
and should work the same as lapplyPartition in terms of execution.

It would be more interesting to know if dapply is somehow worse in terms of 
functionality.  Could you explain what the issue with useBroadcast is ? From 
what I can see the code path is same in RDD.R and DataFrame.R

2. getJRDD: So it looks like the function required here is a way to extract the 
java rdd object from the Dataframe ? In that case there is a `toRDD` function 
in Dataset.scala that we should be able to expose (I think it should just 
involve using `callJMethod(df@sdf, "toRDD")`). This can be made to return a 
jobj instead of a RDD object in R.

3. Thanks for explaining this -- I think we can expose cleanup.jobj as a public 
function to be used to register finalizers.

This JIRA isnt about removing anything yet -- but short term we will need to 
remove / rename parts of the R api of RDDs to satisfy CRAN checks (See 
SPARK-16519). Longer term it would be great to have one code path that is well 
maintained, so knowing what doesn't work with dapply family of functions will 
be very useful.

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-26 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394485#comment-15394485
 ] 

Alok Singh commented on SPARK-16611:


Hi [~shivaram]

Sorry for late reply due to vacation. Here is the detail answers


a) lapply or map
  lapplyPartition or mapPartition
 
We have current pre-processing codes (i.e recode, na impute, outlier etc.)  
implemented that is different that what spark currently provides (but currently 
sparkR doesn't have it)  and hence we uses the apply on dataframe
lapplyPartition was added in the list since we don’t want to re-shuffle data 
for efficiency and would prefer for certain cases lappyPartition instead of 
lapply

b) flatMap:
not the high priority we can live without it :)

c) RDD:toRDD, getJRDD (i.e RDD api)

Since SystemML uses and works internally in the binary block matrix format and 
it’s api is based on the JavaRDD apis and out R ext takes sparkR data frame and 
extract the RDD and pass it to the systemML java api. Note that the use of RDD 
is discouraged from spark as using data frame enables one to use all the 
catalyst optimizer features. However, in our case we are sure that we will just 
get the RDD and convert to the binary block matrix so that systemML can 
consumes and do the heavy lifting.

d) cleanup.jobj:
SystemML uses the MLContext and matrixCharacteristic class that is instantiated 
in JVM and whose object ref is kept alive in the sparkR and later when systemML 
has done it’s computation. we cleanup the objects. The way we achieve it using 
the References classes in R and use it’s finalize method to register the 
cleanup.jobj once we have created the jobj via newJObject(“sysml.class”)



In general, I think goal our DataFrame only api is great but removing RDD apis 
100% would have many issues with out ext package on the top of sparkR. Can we 
continue to keep them  private (if we can't converse to the decision) for now?


For using dapply, one concern, we  have is that dapplyInternal always do the 
broadcast of the variables set via useBroadcast. However, in many cases lapply 
is what one needs as user know for sure that he will not be using the broadcast 
vars. Also lapply falls naturally in the R syntax. 

Thanks
Alok


> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-25 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392493#comment-15392493
 ] 

Shivaram Venkataraman commented on SPARK-16611:
---

[~adrian555] dapply on DataFrames should have the same functionality as lapply 
on RDD

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-25 Thread Weiqiang Zhuang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392360#comment-15392360
 ] 

Weiqiang Zhuang commented on SPARK-16611:
-

[~felixcheung] Alok is out of town and will be returning this week. We are 
currently on Spark 1.6.1. If you are removing the RDD functions from 2.0+, I 
think we are fine so far. But, as you suggested spark.lappy function, does it 
have the same functionality as the current lapply function? Thanks.

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390856#comment-15390856
 ] 

Felix Cheung commented on SPARK-16611:
--

where are we on this? I'm working on removing the RDD functions SPARK-16519

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385033#comment-15385033
 ] 

Felix Cheung commented on SPARK-16611:
--

there's also spark.lapply

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-18 Thread Weiqiang Zhuang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383206#comment-15383206
 ] 

Weiqiang Zhuang commented on SPARK-16611:
-

To answer @shivaram's question: we are calling lapply function to transform the 
dataset so that the system ml can run algorithms on it. The lapply accepts RDD. 
Hence the requirement for the exposure of these APIs and data types. We will 
investigate whether the dapply and gapply will work for the same purpose.

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-18 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383050#comment-15383050
 ] 

Shivaram Venkataraman commented on SPARK-16611:
---

I think its a bit different as SPARK-16581 is only taking about the functions 
we have to call into JVM from R.

[~olarayej] Could you clarify what are some of the use cases you have in mind 
for these functions ? As a project we made the high level decision to only 
expose DataFrames and not RDDs to external users. To offset for the lack of the 
RDD API we now have UDFs on DataFrames in the form of dapply and gapply.  It 
would be great to know if your use cases can be met with dapply / gapply and if 
not we can try to see what is missing from them. That would be easier than 
opening up the entire RDD API.

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-18 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383047#comment-15383047
 ] 

Felix Cheung commented on SPARK-16611:
--

Is this SPARK-16581?

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org