[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-29 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20295
  
@HyukjinKwon Thanks for the comment. I will continue with the current 
approach unless objection raises. I will work on comments and refinements in 
the next day or two.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20295
  
For https://github.com/apache/spark/pull/20295#issuecomment-360297123, I am 
fine without new serialization protocol actually. I didn't have a strong 
preference there because I wasn't sure if it's worth - complexity vs actual 
gain vaguely and seems that's now clarified there. I am okay with the current 
approach.

@BryanCutler, I think the intention here is to follow other few APIs and 
`gapply` in R. I guess you meant the length and metadata stuff by "an optional 
kwargs to each pandas_udf to deal with 0-param udfs" if I remember correctly. I 
think that's slightly different because here the motivation is to provide 
consistent support similarly with other APIs vs the `kwargs` thing sounds 
pretty few concept to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86651/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86651 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86651/testReport)**
 for PR 20295 at commit 
[`6b8fdbc`](https://github.com/apache/spark/commit/6b8fdbc61ce8ef03a5ac54d65babc29c0697c9c4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86651 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86651/testReport)**
 for PR 20295 at commit 
[`6b8fdbc`](https://github.com/apache/spark/commit/6b8fdbc61ce8ef03a5ac54d65babc29c0697c9c4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/241/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86604/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86604 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86604/testReport)**
 for PR 20295 at commit 
[`c7fccde`](https://github.com/apache/spark/commit/c7fccde701723c8808b095586e987ad02fc171c7).
 * This patch **fails PySpark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86606/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86606 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86606/testReport)**
 for PR 20295 at commit 
[`cda97f1`](https://github.com/apache/spark/commit/cda97f142285bc01a2da96765d3703d37b4cb802).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20295
  
Hi all,

I did some digging and I think adding a serialization form that serialize a 
key object along with a Arrow record batch is quite complicated because we are 
using ArrowStreamReader/Writer for sending batches and send extra key data 
would have to use a lower level Arrow API for sending/receiving batches.

I did two things to convince myself the current approach is fine:
* I add logic to de duplicate grouping key they are already in data 
columns. i.e., if a user calls
```
df.groupby('id').apply(foo_udf)
```
We will not send extra grouping columns because those are already part of 
data columns. Instead, we will just use the corresponding data column to get 
grouping key to pass to user function. However, if user calls:
```
df.groupby(df.id % 2).apply(foo_udf)
```
then an extra column `df.id % 2` will be created and sent to python worker. 
But I think this is an uncommon case.

* I did some benchmark to see the impact of sending extra grouping column. 
I used a Spark DataFrame of a single column to maximize the effect of the extra 
grouping column (basically sending extra grouping column will double the data 
to be sent to python in the benchmark, however in real use cases the effect of 
sending extra grouping columns should be far less).
Even with the setting of the benchmark, I have not observed performance 
diffs when sending extra grouping columns, I think this is because the total 
time is dominated by other parts of the computation. [micro 
benchmark](https://gist.github.com/icexelloss/88f6c6fdaf04aac39d68d74cd0942c07)

I'd like to leave the work for more flexible arrow serialization as future 
work because it doesn't seems to affect performance of this patch and proceed 
with the current patch based on the two points above. What do you guys think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/206/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86606 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86606/testReport)**
 for PR 20295 at commit 
[`cda97f1`](https://github.com/apache/spark/commit/cda97f142285bc01a2da96765d3703d37b4cb802).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/205/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86604 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86604/testReport)**
 for PR 20295 at commit 
[`c7fccde`](https://github.com/apache/spark/commit/c7fccde701723c8808b095586e987ad02fc171c7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/203/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-22 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20295
  
Let me experiment with new serialization approach. Will update here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20295
  
To me, seems roughly fine.

> Alternatively, we can implement a new serialization protocol for 
GROUP_MAP eval type, i.e, instead of sending an arrow batch, we could send a 
group row and then an arrow batch.

I don't have a strong preference on this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-19 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20295
  
Yep, that's correct.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-18 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20295
  
How do we turn a single group column to a series? just repeat the group 
column?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-18 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20295
  
@cloud-fan Currently I sent group columns along with the extra data column. 
For example, if the original DataFrame has `id, v` and group column is `id`, 
the current implementation in this PR will send three series `id, id, v` to the 
python worker and send an `argOffsets` of `[1, 2]` to specify the data columns 
are `id, v`. The first value of the group column is used as the group key, 
because values in a group column are equal.

I implemented it this way because it doesn't change the existing 
serialization protocol. Alternatively, we can implement a new serialization 
protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we 
could send a group row and then an arrow batch. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-17 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20295
  
How are you going to send the group columns? For a group we have only one 
group row and a bunch of data rows.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86290/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86290 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86290/testReport)**
 for PR 20295 at commit 
[`7ce3fa7`](https://github.com/apache/spark/commit/7ce3fa79f8f63d4320c386ed81fa0b57d6bb7a91).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-17 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20295
  
cc @ueshin @HyukjinKwon @cloud-fan @viirya 

This PR implements discussion here 
https://github.com/apache/spark/pull/20211#pullrequestreview-87657832. There 
are more refinement needs to be done but I'd like to get some early feedback 
whether this approach looks good  in general.

The general idea is to pass grouping columns as extra columns to the python 
worker and to use `argOffsets` to specify which columns are grouping columns. 
Finally, we convert a grouping columns to a single tuple before enter user 
function. This is slightly inefficient because grouping columns always have the 
same value for each group. But I think this is OK because grouping columns 
should be relatively small comparing to the entire DataFrame.

I can also implement some kind of de duplicate logic in 
`FlatMapGroupsInPandasExec`, however that would require creating another 
`UnsafeProjection`, I am not sure if it's worth it performance wise. 

WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...

2018-01-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86290 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86290/testReport)**
 for PR 20295 at commit 
[`7ce3fa7`](https://github.com/apache/spark/commit/7ce3fa79f8f63d4320c386ed81fa0b57d6bb7a91).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: wip: [SPARK-23011] Support alternative function form wit...

2018-01-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86286/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: wip: [SPARK-23011] Support alternative function form wit...

2018-01-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86286 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86286/testReport)**
 for PR 20295 at commit 
[`38195ac`](https://github.com/apache/spark/commit/38195ac7f0b9e1227cfc1e407de47e276b3fc43f).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: wip: [SPARK-23011] Support alternative function form wit...

2018-01-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20295
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20295: wip: [SPARK-23011] Support alternative function form wit...

2018-01-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20295
  
**[Test build #86286 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86286/testReport)**
 for PR 20295 at commit 
[`38195ac`](https://github.com/apache/spark/commit/38195ac7f0b9e1227cfc1e407de47e276b3fc43f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org