Github user viirya commented on the issue:
https://github.com/apache/spark/pull/17819
@MLnick That's right. I also have concern about this. However, to keep the
original single-column Bucketizer and multiple-column Bucketizer in one class
seems also producing a messy code.
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17819
I will try to take a look soon. My main concern is whether we should really
have a new class - it starts to make things really messy if we introduce
`Multi` versions of everything (e.g. we may want
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/17819
ping @MLnick Do you have more comments on this? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/17819
@barrybecker4 `withColumns` API is first introduced in this PR. So you
won't see it in Spark 2.1.1 or current codebase. Thanks for letting me know
SPARK-12225. Yes, it is related.
---
If your
Github user barrybecker4 commented on the issue:
https://github.com/apache/spark/pull/17819
I don't see support for withColumns in spark 2.1.1. Which version does it
first appear? This work seems related to
https://issues.apache.org/jira/browse/SPARK-12225.
---
If your project is
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/17819
Note: since in `Transformer`, there might be other manipulation to the
dataset like dropping NaN values. The idea above won't work under that.
---
If your project is set up for it, you can reply to
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/17819
The bunch of projections will be collapsed in optimization. So it doesn't
affect query execution. However, every `withColumn` call creates new
`DataFrame` along with a projection on previous logical
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17819
Thanks. Result does look good.
So the improvement is really coming from the new `withColumns` that avoids
a bunch of projections in the plan in favor of one (more or less)? So the same
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/17819
@MLnick I've done a benchmark using the test dataset provided in JIRA
SPARK-20392 (blockbuster.csv).
The ML pipeline includes 2 `StringIndexer`s and 1 `MultipleBucketizer` or
137
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17819
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17819
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76406/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17819
**[Test build #76406 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76406/testReport)**
for PR 17819 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17819
**[Test build #76406 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76406/testReport)**
for PR 17819 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17819
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76379/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17819
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17819
**[Test build #76379 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76379/testReport)**
for PR 17819 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17819
**[Test build #76379 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76379/testReport)**
for PR 17819 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/17819
@MLnick Ok. Let me prepare the comparisons.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17819
@viirya can you post some performance comparisons for this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
19 matches
Mail list logo