Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
Thanks @gatorsmile @WeichenXu123 @zhengruifeng
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19229
Thanks! Merged to master.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19229
LGTM
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.ap
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82369/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #82369 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82369/testReport)**
for PR 19229 at commit
[`1292ce0`](https://github.com/apache/spark/commit/1
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #82369 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82369/testReport)**
for PR 19229 at commit
[`1292ce0`](https://github.com/apache/spark/commit/12
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
@gatorsmile `withColumn` is reimplemented now. Please take a look when you
have time. Thanks.
---
-
To unsubscribe, e-mail: revie
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #82347 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82347/testReport)**
for PR 19229 at commit
[`21048a8`](https://github.com/apache/spark/commit/2
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82347/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #82347 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82347/testReport)**
for PR 19229 at commit
[`21048a8`](https://github.com/apache/spark/commit/21
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
retest this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82344/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #82344 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82344/testReport)**
for PR 19229 at commit
[`21048a8`](https://github.com/apache/spark/commit/2
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #82344 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82344/testReport)**
for PR 19229 at commit
[`21048a8`](https://github.com/apache/spark/commit/21
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
also cc @jkbradley and @MLnick for final check of the ML change. Thanks.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.a
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
ping @gatorsmile Can you take a quick look? Thanks.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additio
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
ping @gatorsmile for the SQL part.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-m
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
@gatorsmile Added the check for case sensitivity. Please take a look again.
Thanks.
---
-
To unsubscribe, e-mail: reviews-unsubsc
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82141/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #82141 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82141/testReport)**
for PR 19229 at commit
[`07dec0f`](https://github.com/apache/spark/commit/0
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #82141 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82141/testReport)**
for PR 19229 at commit
[`07dec0f`](https://github.com/apache/spark/commit/07
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
@WeichenXu123 Have any more comments on this? Thanks. I think the ML part
is straightforward.
---
-
To unsubscribe, e-mail: revie
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya Yeah the perf gap I only focus on `mean` which can take advantage
of codegen.
---
-
To unsubscribe, e-mail: reviews
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
Yeah, I think that fix should work for the strategy `Imputer.mean` because
`Imputer.mean` aggregates many columns at once now and that can be a too large
gen'd code for aggregation.
For the
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
The performance gap issue (compared with RDD version), I create a separated
JIRA to track:
https://issues.apache.org/jira/browse/SPARK-22105
As the result of discussion with @cloud-fan ,
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19229
I am not familiar with SQL source, but I think it's great to transform all
columns at a time
---
-
To unsubscribe, e-mail:
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
ping @zhengruifeng @WeichenXu123 Any more comments on this? Thanks.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81964/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #81964 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81964/testReport)**
for PR 19229 at commit
[`2086900`](https://github.com/apache/spark/commit/2
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya Thanks very much! Although the perf gap exists (when numCols is
large), it won't block this PR. I will create a JIRA to track this.
---
---
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
numColums | RDD Mean | RDD Median | DataFrame Mean | DataFrame Median
-- | -- | -- | -- | --
1 | 0.1642173481 | 0.199774305 | 0.4260180671006 | 0.2025112919
10 | 0.3713707549 | 0.529010
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
I ran the test codes to benchmark RDD-version and DataFrame version with
this `ImputerModel` change:
import org.apache.spark.ml.feature._
import org.apache.spark.sql.{DataFrame, R
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #81964 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81964/testReport)**
for PR 19229 at commit
[`2086900`](https://github.com/apache/spark/commit/20
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
Oh. That's what have done in the old PR #18902 .(Because the RDD version
(not in master branch, only personal impl here (sorry for put wrong link, the
code link is here:
https://github.com/apa
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
@WeichenXu123 I'm not sure I understand it correctly. This change only
replaces the chain of `withColumn` to a pass of `withColumns`. We don't have
RDD version for this, so I'm not sure what version
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya No, keep the dataframe version code. But I only want to confirm how
much performance gap between this and RDD version. (for possible improvements
in the future, because in similar test
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
@WeichenXu123 Thanks for verifying that.
Do you mean using ApproxQuantiles to compute mean and median? But I think
this change is not intended to improve this part.
---
---
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya I run the code, you're right, most of time cost on the executedPlan
generation (The old version code). thanks!
But can you append benchmark comparison with `RDD.aggregate` version?
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
Looks not the reason. maybe issues somewhere else. Let me run test later.
Thanks!
But there is some small issues in test:
Don't include gen data time:
```
val start = System.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
Btw, I don't see any hint about
`df.withColumn(..).withColumn(..).withColumn(..)` can prevent the shuffle
re-using.
---
-
To uns
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
I don't think re-using shuffle is the reason behind the numbers. If you
looked at the previous comments, you will find that I ran the test before
without `count` after `model.transform`. Namely the d
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
New numbers:
numColums | Old Mean | Old Median | New Mean | New Median
-- | -- | -- | -- | --
1 | 0.1727832915997 | 0.1537169693 | 0.1687325048997 | 0.1521283075
10 | 0.325
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
Great! That's it. thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
Updated test codes:
import org.apache.spark.ml.feature._
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.types._
import spark.implicits._
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya I guess the reason is, the old PR version:
`df.withColumn(..).withColumn(..).withColumn(..)`, the long df chain
prevent the shuffle re-using... but now you merge them into one step.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
@WeichenXu123 Btw, the test is basically re-using the codes from
https://github.com/apache/spark/pull/18902#issuecomment-321727416. Is your
concern is specified for this?
---
-
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
@WeichenXu123 Sure. And I must point out that I ran this benchmark in
spark-shell under local mode. It is great if you can run the benchmark too to
verify the numbers.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81756/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #81756 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81756/testReport)**
for PR 19229 at commit
[`4b47709`](https://github.com/apache/spark/commit/4
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19229
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81755/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #81755 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81755/testReport)**
for PR 19229 at commit
[`4efb643`](https://github.com/apache/spark/commit/4
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
@zhengruifeng Yeah, it is better. Actually the difference between running
multiple `withColumn` and one `withColumns` is mainly in the cost of query
analysis and plan/dataset initialization. I will
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19229
In the test code, should we use `model.transform(df).count` instead?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
FYI, the `withColumns` API was proposed in #17819.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For addition
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
Ran the similar benchmark as
https://github.com/apache/spark/pull/18902#issuecomment-321727416:
numColums | Old Mean | Old Median | New Mean | New Median
-- | -- | -- | -- | --
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #81756 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81756/testReport)**
for PR 19229 at commit
[`4b47709`](https://github.com/apache/spark/commit/4b
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19229
cc @MLnick @zhengruifeng @yanboliang
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19229
**[Test build #81755 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81755/testReport)**
for PR 19229 at commit
[`4efb643`](https://github.com/apache/spark/commit/4e
65 matches
Mail list logo