[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 @MLnick That's right. I also have concern about this. However, to keep the original single-column Bucketizer and multiple-column Bucketizer in one class seems also producing a messy code.

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-18 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17819 I will try to take a look soon. My main concern is whether we should really have a new class - it starts to make things really messy if we introduce `Multi` versions of everything (e.g. we may want

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 ping @MLnick Do you have more comments on this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-10 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 @barrybecker4 `withColumns` API is first introduced in this PR. So you won't see it in Spark 2.1.1 or current codebase. Thanks for letting me know SPARK-12225. Yes, it is related. --- If your

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-09 Thread barrybecker4
Github user barrybecker4 commented on the issue: https://github.com/apache/spark/pull/17819 I don't see support for withColumns in spark 2.1.1. Which version does it first appear? This work seems related to https://issues.apache.org/jira/browse/SPARK-12225. --- If your project is

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-04 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 Note: since in `Transformer`, there might be other manipulation to the dataset like dropping NaN values. The idea above won't work under that. --- If your project is set up for it, you can reply to

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-04 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 The bunch of projections will be collapsed in optimization. So it doesn't affect query execution. However, every `withColumn` call creates new `DataFrame` along with a projection on previous logical

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-04 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17819 Thanks. Result does look good. So the improvement is really coming from the new `withColumns` that avoids a bunch of projections in the plan in favor of one (more or less)? So the same

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-04 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 @MLnick I've done a benchmark using the test dataset provided in JIRA SPARK-20392 (blockbuster.csv). The ML pipeline includes 2 `StringIndexer`s and 1 `MultipleBucketizer` or 137

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76406/ Test PASSed. ---

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17819 **[Test build #76406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76406/testReport)** for PR 17819 at commit

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17819 **[Test build #76406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76406/testReport)** for PR 17819 at commit

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76379/ Test PASSed. ---

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17819 **[Test build #76379 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76379/testReport)** for PR 17819 at commit

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17819 **[Test build #76379 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76379/testReport)** for PR 17819 at commit

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 @MLnick Ok. Let me prepare the comparisons. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

2017-05-02 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17819 @viirya can you post some performance comparisons for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not