[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-23 Thread NarineK
Github user NarineK commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-221157067
  
sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-23 Thread NarineK
Github user NarineK closed the pull request at:

https://github.com/apache/spark/pull/12966


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-23 Thread sun-rui
Github user sun-rui commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-221144808
  
@NarineK, I think we can close this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-09 Thread sun-rui
Github user sun-rui commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-218022565
  
i suggest we follow the original design. i will take a detailed look at 
previous PR soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-09 Thread NarineK
Github user NarineK commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217971884
  
This is similar to the following test case for repatition:

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L1238


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-09 Thread NarineK
Github user NarineK commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217958625
  
Well, since the user provides the R function, I think he/she should provide 
the aggregate too. 
Instead of providing:
```
function(x) {
 data.frame(x$Species[1], mean(x$Sepal_Width), stringsAsFactors = FALSE)
 }
```

They will provide:
```
function(x) {
  data.frame(aggregate(x$Sepal_Width, by = list(x$Species), FUN = 
mean), stringsAsFactors = FALSE)
}
```

This is my understanding, there may be also other ways ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-09 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217956433
  
I am not sure why it should affect SparkR users. Users will still give us 
the same function to `gapply` as before but we will implement this inside 
SparkR side using the `aggregate` code snippet above ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-09 Thread NarineK
Github user NarineK commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217955181
  
What do you think, @shivaram , @sun-rui , @felixcheung  ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-09 Thread NarineK
Github user NarineK commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217945476
  
I see, that's sounds good too, but, I'm not sure how user friendly it will 
be.
I guess for R side we need something like this:
df <- createDataFrame (sqlContext, iris)
schema <-  structType(structField("Species", "string"), structField("avg", 
"double"))
df <- repartition(df, col=df$"Species")
df1 <- dapply(
df,
function(x) {
data.frame(aggregate(x$Sepal_Width, by = list(x$Species), FUN = 
mean), stringsAsFactors = FALSE)
},
schema)
collect(df1)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-09 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217939409
  
@NarineK After repartition(), all the row with same grouping key are in the 
same partition, so we could have another groupBy() in R for single partition, 
then do the aggregate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-09 Thread NarineK
Github user NarineK commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217936323
  
Thanks, @davies !
This means that we cannot implement group-apply using repartitioning. 
What would you suggest in this case ?
My previous pull request works fine for one key.

We can try another implementation with groupBy-> agg . In this case I 
understand that we will need to implement imperative or declarative aggregate 
which most probably will collect the rows in a buffer with the same grouping 
column and pass that buffer with the rows to R worker.
Is this what you'd prefer ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-09 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217932852
  
@NarineK There is no such partitioner right now (it can't be cheap).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-08 Thread NarineK
Github user NarineK commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217739297
  
Thanks, @sun-rui !

Yes, It seems to be the case @sun-rui .
I've recently hit the case where the number of partitions was less than the 
number of actual groups.

I've tried the same thing on my previous implementation with groupByKey -> 
flatMap and it works perfectly fine with any repartitioning.

Maybe @davies, has some suggestions about this.  
If there is any repartitioner which will guarantee that each group will be 
in a single partition then we can use it otherwise this won't give us the 
expected result.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-08 Thread sun-rui
Github user sun-rui commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217699516
  
@NarineK, it is guaranteed that all items in a same group will be in a same 
partition. But  it is not guaranteed that there is only single group in a 
partition. There could be multiple groups in a partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-07 Thread NarineK
Github user NarineK commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217662459
  
Hi @sun-rui , 
I think it depends on how we do the repartitioning. It shouldn't be the 
case when we do it by : 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2191
@davies , can it be the case ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-07 Thread sun-rui
Github user sun-rui commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217632210
  
@NarineK , @shivaram . sorry for missing the previous discussion. The 
problem is that repartition() can have multiple groups in a partition. This is 
not gapply() meant to handle with. Do I miss something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-07 Thread NarineK
Github user NarineK commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217612497
  
Thanks, @shivaram !
I have one question regarding factor datatypes. It seems that SparkR 
doesn't support it and we have to set: `stringsAsFactors = FALSE` in order to 
avoid "UnsupportedType" exceptions in data.frame.

Is it possible to map factor to maybe to list or string ... ?

R's data.frame converts strings to factor ... , but a factor in general 
doesn't have to be a string.

here is a snapshot from R's documentation: 

`Character variables passed to data.frame are converted to factor columns 
unless protected by I or argument stringsAsFactors is false. 
`
https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-07 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217611993
  
Change LGTM. Thanks @NarineK 

cc @sun-rui @felixcheung @davies - any other comments on this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217611333
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217611334
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58059/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217611301
  
**[Test build #58059 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58059/consoleFull)**
 for PR 12966 at commit 
[`057ff9b`](https://github.com/apache/spark/commit/057ff9b30e56de4172c957e467b9b35dd932999a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217610810
  
**[Test build #58059 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58059/consoleFull)**
 for PR 12966 at commit 
[`057ff9b`](https://github.com/apache/spark/commit/057ff9b30e56de4172c957e467b9b35dd932999a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217602308
  
**[Test build #58050 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58050/consoleFull)**
 for PR 12966 at commit 
[`bf3a74d`](https://github.com/apache/spark/commit/bf3a74d34b21eaa6c3d1422c1135658d9be58a8a).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217602310
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217602311
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58050/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217601769
  
**[Test build #58050 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58050/consoleFull)**
 for PR 12966 at commit 
[`bf3a74d`](https://github.com/apache/spark/commit/bf3a74d34b21eaa6c3d1422c1135658d9be58a8a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217593702
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217593698
  
**[Test build #58043 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58043/consoleFull)**
 for PR 12966 at commit 
[`9704956`](https://github.com/apache/spark/commit/97049564433607544beef439ddce272f607298d9).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217593705
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58043/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217592573
  
**[Test build #58043 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58043/consoleFull)**
 for PR 12966 at commit 
[`9704956`](https://github.com/apache/spark/commit/97049564433607544beef439ddce272f607298d9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217588848
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217588838
  
**[Test build #58041 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58041/consoleFull)**
 for PR 12966 at commit 
[`204a105`](https://github.com/apache/spark/commit/204a1053dabd74d39a25e725276e31bb3a592917).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217588851
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58041/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217587290
  
**[Test build #58041 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58041/consoleFull)**
 for PR 12966 at commit 
[`204a105`](https://github.com/apache/spark/commit/204a1053dabd74d39a25e725276e31bb3a592917).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/12966#discussion_r62400586
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1214,6 +1214,77 @@ setMethod("dapply",
 dataFrame(sdf)
   })
 
+#' gapply
+#'
+#' Apply a function to each group of a DataFrame. The group is defined by 
an input
+#' grouping column(s).
+#'
+#' @param x A SparkDataFrame
+#' @param func A function to be applied to each group partition specified 
by grouping
+#' column(s) of the SparkDataFrame.
+#' The output of func is a local R data.frame.
+#' @param schema The schema of the resulting SparkDataFrame after the 
function is applied.
+#'   It must match the output of func.
+#' @family SparkDataFrame functions
+#' @rdname gapply
+#' @name gapply
+#' @export
+#' @examples
+#'
+#' \dontrun{
+#'
+#' Computes the arithmetic mean of `Sepal_Width` by grouping
+#' on `Species`. Output the grouping value and the average.
+#'
+#' df <- createDataFrame (sqlContext, iris)
+#' schema <-  structType(structField("Species", "string"), 
structField("Avg", "double"))
+#' df1 <- gapply(
+#'   df,
+#'   function(x) {
+#' data.frame(x$Species[1], mean(x$Sepal_Width), stringsAsFactors = 
FALSE)
+#'   },
+#'   schema, col=df$"Species")
+#' collect(df1)
+#'
+#' Species  Avg
+#' -
+#' virginica   2.974
+#' versicolor  2.770
+#' setosa  3.428
+#'
+#' Fits linear models on iris dataset by grouping on the `Species` column 
and
+#' using `Sepal_Length` as a target variable, `Sepal_Width`, `Petal_Length`
+#' and `Petal_Width` as training features.
+#'
+#' df <- createDataFrame (sqlContext, iris)
+#' schema <- structType(structField("(Intercept)", "double"),
+#'   structField("Sepal_Width", "double"), structField("Petal_Length", 
"double"),
+#'   structField("Petal_Width", "double"))
+#' df1 <- gapply(
+#'   df,
+#'   function(x) {
+#' model <- suppressWarnings(lm(Sepal_Length ~
+#' Sepal_Width + Petal_Length + Petal_Width, x))
+#' data.frame(t(coef(model)))
+#'   }, schema, df$"Species")
+#' collect(df1)
+#'
+#'Result
+#'-
+#' Model  (Intercept)  Sepal_Width  Petal_Length  Petal_Width
+#' 10.6998830.33033700.9455356-0.1697527
+#' 21.8955400.38685760.9083370-0.6792238
+#' 32.3518900.65483500.2375602 0.2521257
+#'
+#'}
+setMethod("gapply",
+  signature(x = "SparkDataFrame", func = "function", schema = 
"structType",
+col = "Column"),
--- End diff --

I might have an issue in the signature, I'll fix it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217580569
  
**[Test build #58037 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58037/consoleFull)**
 for PR 12966 at commit 
[`30693c2`](https://github.com/apache/spark/commit/30693c2b40ab459a9fe252a2d00595c8190f2094).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217580580
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58037/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217580578
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread NarineK
Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/12966#discussion_r62400077
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1214,6 +1214,77 @@ setMethod("dapply",
 dataFrame(sdf)
   })
 
+#' gapply
+#'
+#' Apply a function to each group of a DataFrame. The group is defined by 
an input
+#' grouping column(s).
+#'
+#' @param x A SparkDataFrame
+#' @param func A function to be applied to each group partition specified 
by grouping
+#' column(s) of the SparkDataFrame.
+#' The output of func is a local R data.frame.
+#' @param schema The schema of the resulting SparkDataFrame after the 
function is applied.
+#'   It must match the output of func.
+#' @family SparkDataFrame functions
+#' @rdname gapply
+#' @name gapply
+#' @export
+#' @examples
+#'
+#' \dontrun{
+#'
+#' Computes the arithmetic mean of `Sepal_Width` by grouping
+#' on `Species`. Output the grouping value and the average.
+#'
+#' df <- createDataFrame (sqlContext, iris)
+#' schema <-  structType(structField("Species", "string"), 
structField("Avg", "double"))
+#' df1 <- gapply(
+#'   df,
+#'   function(x) {
+#' data.frame(x$Species[1], mean(x$Sepal_Width), stringsAsFactors = 
FALSE)
+#'   },
+#'   schema, col=df$"Species")
+#' collect(df1)
+#'
+#' Species  Avg
+#' -
+#' virginica   2.974
+#' versicolor  2.770
+#' setosa  3.428
+#'
+#' Fits linear models on iris dataset by grouping on the `Species` column 
and
+#' using `Sepal_Length` as a target variable, `Sepal_Width`, `Petal_Length`
+#' and `Petal_Width` as training features.
+#'
+#' df <- createDataFrame (sqlContext, iris)
+#' schema <- structType(structField("(Intercept)", "double"),
+#'   structField("Sepal_Width", "double"), structField("Petal_Length", 
"double"),
+#'   structField("Petal_Width", "double"))
+#' df1 <- gapply(
+#'   df,
+#'   function(x) {
+#' model <- suppressWarnings(lm(Sepal_Length ~
+#' Sepal_Width + Petal_Length + Petal_Width, x))
+#' data.frame(t(coef(model)))
+#'   }, schema, df$"Species")
+#' collect(df1)
+#'
+#'Result
+#'-
+#' Model  (Intercept)  Sepal_Width  Petal_Length  Petal_Width
+#' 10.6998830.33033700.9455356-0.1697527
+#' 21.8955400.38685760.9083370-0.6792238
+#' 32.3518900.65483500.2375602 0.2521257
+#'
+#'}
+setMethod("gapply",
+  signature(x = "SparkDataFrame", func = "function", schema = 
"structType",
+col = "Column"),
--- End diff --

yes, absolutely!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/12966#discussion_r62399743
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1214,6 +1214,77 @@ setMethod("dapply",
 dataFrame(sdf)
   })
 
+#' gapply
+#'
+#' Apply a function to each group of a DataFrame. The group is defined by 
an input
+#' grouping column(s).
+#'
+#' @param x A SparkDataFrame
+#' @param func A function to be applied to each group partition specified 
by grouping
+#' column(s) of the SparkDataFrame.
+#' The output of func is a local R data.frame.
+#' @param schema The schema of the resulting SparkDataFrame after the 
function is applied.
+#'   It must match the output of func.
+#' @family SparkDataFrame functions
+#' @rdname gapply
+#' @name gapply
+#' @export
+#' @examples
+#'
+#' \dontrun{
+#'
+#' Computes the arithmetic mean of `Sepal_Width` by grouping
+#' on `Species`. Output the grouping value and the average.
+#'
+#' df <- createDataFrame (sqlContext, iris)
+#' schema <-  structType(structField("Species", "string"), 
structField("Avg", "double"))
+#' df1 <- gapply(
+#'   df,
+#'   function(x) {
+#' data.frame(x$Species[1], mean(x$Sepal_Width), stringsAsFactors = 
FALSE)
+#'   },
+#'   schema, col=df$"Species")
+#' collect(df1)
+#'
+#' Species  Avg
+#' -
+#' virginica   2.974
+#' versicolor  2.770
+#' setosa  3.428
+#'
+#' Fits linear models on iris dataset by grouping on the `Species` column 
and
+#' using `Sepal_Length` as a target variable, `Sepal_Width`, `Petal_Length`
+#' and `Petal_Width` as training features.
+#'
+#' df <- createDataFrame (sqlContext, iris)
+#' schema <- structType(structField("(Intercept)", "double"),
+#'   structField("Sepal_Width", "double"), structField("Petal_Length", 
"double"),
+#'   structField("Petal_Width", "double"))
+#' df1 <- gapply(
+#'   df,
+#'   function(x) {
+#' model <- suppressWarnings(lm(Sepal_Length ~
+#' Sepal_Width + Petal_Length + Petal_Width, x))
+#' data.frame(t(coef(model)))
+#'   }, schema, df$"Species")
+#' collect(df1)
+#'
+#'Result
+#'-
+#' Model  (Intercept)  Sepal_Width  Petal_Length  Petal_Width
+#' 10.6998830.33033700.9455356-0.1697527
+#' 21.8955400.38685760.9083370-0.6792238
+#' 32.3518900.65483500.2375602 0.2521257
+#'
+#'}
+setMethod("gapply",
+  signature(x = "SparkDataFrame", func = "function", schema = 
"structType",
+col = "Column"),
--- End diff --

can we handle multiple columns now that we are using repartition ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217578915
  
**[Test build #58035 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58035/consoleFull)**
 for PR 12966 at commit 
[`be5de6a`](https://github.com/apache/spark/commit/be5de6a42e50c42b4af15b87623bc7b49aecb353).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217578934
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217578938
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58035/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217577994
  
**[Test build #58037 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58037/consoleFull)**
 for PR 12966 at commit 
[`30693c2`](https://github.com/apache/spark/commit/30693c2b40ab459a9fe252a2d00595c8190f2094).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12966#issuecomment-217575956
  
**[Test build #58035 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58035/consoleFull)**
 for PR 12966 at commit 
[`be5de6a`](https://github.com/apache/spark/commit/be5de6a42e50c42b4af15b87623bc7b49aecb353).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15196][SparkR] Add a wrapper for dapply...

2016-05-06 Thread NarineK
GitHub user NarineK opened a pull request:

https://github.com/apache/spark/pull/12966

[SPARK-15196][SparkR] Add a wrapper for dapply(repartiition(col,...), ... )

## What changes were proposed in this pull request?

As mentioned in :
https://github.com/apache/spark/pull/12836#issuecomment-217338855
We would like to create a wrapper for: dapply(repartiition(col,...), ... )
This will allow to run aggregate functions on groups which are identified 
by a list of grouping columns.

I called the wrapper method gapply. We can rename it if we want to call it 
differently.
We could also have:
` setMethod("dapply", signature(x = "SparkDataFrame", func = "function", 
schema = "structType", col = "Column"),` , however, dapply already has many 
examples in the doc and if we add new examples for aggregate functions that 
will make the documentation longer and less clear.

## How was this patch tested?
Unit tests
1. Group by a column and compute mean
2.  Group by a column and train linear model





You can merge this pull request into a Git repository by running:

$ git pull https://github.com/NarineK/spark repartitionWithDapply

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12966.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12966


commit be5de6a42e50c42b4af15b87623bc7b49aecb353
Author: NarineK 
Date:   2016-05-06T21:51:35Z

Add a wrapper for dapply(repartiition(col,...), ... )




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org