Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18902
@MLnick Thanks for pinging me.
I go through this quickly. The basic idea is the same, performing the
operations on multiple inputs columns at one single Dataset/DataFrame operation.
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/18902
cc @viirya on the multt-column generation issue - could be similar general
solution to #17819?
---
-
To unsubscribe, e-mail:
Github user yanboliang commented on the issue:
https://github.com/apache/spark/pull/18902
Merged into master. Thanks for all.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
Any more comments on this PR? It have been about one month since the last
modification.
---
-
To unsubscribe, e-mail:
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
Sure. I will create JIRA after this perf gap is confirmed.
---
-
To unsubscribe, e-mail:
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/18902
Seems fine to me to use the DF version even though it's slower. But we
should open a JIRA issue to track where the gap is on the SQL side of things
and try to improve the performance.
---
If your
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
hmm... that's interesting. So I found performance gap between dataframe
codegen aggregation and the simple RDD aggregation. I will discuss with SQL
team for this later. Thanks!
---
If your
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
@WeichenXu123 No, I only cache the DataFrame. And the RDD-Version is
[here](https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465).
I use the same
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
+1 for using Dataframe-based version code.
@zhengruifeng One thing I want to confirm is that, I check your testing
code, both RDD-based version and Dataframe-based version code will
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
@yanboliang Although dispointed by DF's performance, I also approve the
choice of DF just for less code.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user yanboliang commented on the issue:
https://github.com/apache/spark/pull/18902
@zhengruifeng DataFrame-based operation is 2~3x slower than RDD-based
operation is a known issue, because of the deserialization cost. If we switch
to RDD-based method, we need to implement our
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
@yanboliang RDD-based impl the (former
commit)[https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465]
---
If your project is set up for it, you can
Github user yanboliang commented on the issue:
https://github.com/apache/spark/pull/18902
@zhengruifeng What _the RDD-based one_ means? It's the code on master or
the code in your former commit? Thanks
---
If your project is set up for it, you can reply to this email and have your
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
@MLnick @yanboliang I update the performance comparison.
The DF-based impl is a little slower than the RDD-based one when num of
column is small.
When num of column is large (100),
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80780 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80780/testReport)**
for PR 18902 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80780/
Test PASSed.
---
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/18902
@zhengruifeng Could you verify & compare the performance of this new
DF-based approach vs your original RDD-based one?
---
If your project is set up for it, you can reply to this email and have
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80780 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80780/testReport)**
for PR 18902 at commit
Github user yanboliang commented on the issue:
https://github.com/apache/spark/pull/18902
@hhbyyh @zhengruifeng I'm ok with the _convert to null_ method, I think
there is no extra pass for data if we handle it with this way, and the
DataFrame/RDD functions to compute _mean/median_
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
I test on dataframes containing `null`, both `avg` and
`stat.approxQuantile` will ignore `null`. And if one column only contain
`null`, `null` and `Array.empty[Double]` will be returned
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18902
Thanks for the quick update. The implementation may be improved on some
details. But first I'd want to confirm the "convert to null" method does not
have any defect.
@MLnick @srowen @yanboliang
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
@hhbyyh I rewrite the impl, and now all `NaN` and `missingValue` will be
transform to `null` at first, then current methods are used.
For columns only containing `null`, `null` is
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80675/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80675 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80675/testReport)**
for PR 18902 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80675 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80675/testReport)**
for PR 18902 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80667 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80667/testReport)**
for PR 18902 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80667/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80666/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80666 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80666/testReport)**
for PR 18902 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80663/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80663 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80663/testReport)**
for PR 18902 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80667 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80667/testReport)**
for PR 18902 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80666 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80666/testReport)**
for PR 18902 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80663 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80663/testReport)**
for PR 18902 at commit
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
Jenkins, retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80660/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80660 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80660/testReport)**
for PR 18902 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80660 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80660/testReport)**
for PR 18902 at commit
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
@hhbyyh Good Idea! We can also use this trick to compute median, because
method
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18902
Eh, I meant that it may be possible to get the mean values purely using
DataFrame API. (convert missingValue/NaN to null) in one pass, so we may need
to check the performance comparison. But I guess
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80653/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80653 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80653/testReport)**
for PR 18902 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80653 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80653/testReport)**
for PR 18902 at commit
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
I test the performance on a small data, the value in the following table is
the average duration in seconds:
|numColums| Old Mean | Old Median | New Mean | New Median |
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
@hhbyyh Yes, I will test the performance.
Btw, the median computation by call `stat.approxQuantile` will also
transform df into rdd before aggregation. see
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18902
Hi @zhengruifeng Thanks for the idea and implementation. Definitely
something worth exploring.
As I understand, the new implementation improves the locality yet it
leverages RDD API
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80479/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80479 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80479/testReport)**
for PR 18902 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80479 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80479/testReport)**
for PR 18902 at commit
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18902
Jenkis, retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80478/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80478 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80478/testReport)**
for PR 18902 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80477 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80477/testReport)**
for PR 18902 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80478 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80478/testReport)**
for PR 18902 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80477/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18902
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18902
**[Test build #80477 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80477/testReport)**
for PR 18902 at commit
66 matches
Mail list logo