[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-13 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18902 @MLnick Thanks for pinging me. I go through this quickly. The basic idea is the same, performing the operations on multiple inputs columns at one single Dataset/DataFrame operation.

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-13 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18902 cc @viirya on the multt-column generation issue - could be similar general solution to #17819? --- - To unsubscribe, e-mail:

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-13 Thread yanboliang
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/18902 Merged into master. Thanks for all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 Any more comments on this PR? It have been about one month since the last modification. --- - To unsubscribe, e-mail:

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 Sure. I will create JIRA after this perf gap is confirmed. --- - To unsubscribe, e-mail:

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-04 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18902 Seems fine to me to use the DF version even though it's slower. But we should open a JIRA issue to track where the gap is on the SQL side of things and try to improve the performance. --- If your

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 hmm... that's interesting. So I found performance gap between dataframe codegen aggregation and the simple RDD aggregation. I will discuss with SQL team for this later. Thanks! --- If your

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @WeichenXu123 No, I only cache the DataFrame. And the RDD-Version is [here](https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465). I use the same

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 +1 for using Dataframe-based version code. @zhengruifeng One thing I want to confirm is that, I check your testing code, both RDD-based version and Dataframe-based version code will

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-28 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @yanboliang Although dispointed by DF's performance, I also approve the choice of DF just for less code. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-25 Thread yanboliang
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/18902 @zhengruifeng DataFrame-based operation is 2~3x slower than RDD-based operation is a known issue, because of the deserialization cost. If we switch to RDD-based method, we need to implement our

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @yanboliang RDD-based impl the (former commit)[https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465] --- If your project is set up for it, you can

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread yanboliang
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/18902 @zhengruifeng What _the RDD-based one_ means? It's the code on master or the code in your former commit? Thanks --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @MLnick @yanboliang I update the performance comparison. The DF-based impl is a little slower than the RDD-based one when num of column is small. When num of column is large (100),

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80780 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80780/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80780/ Test PASSed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18902 @zhengruifeng Could you verify & compare the performance of this new DF-based approach vs your original RDD-based one? --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80780 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80780/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread yanboliang
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/18902 @hhbyyh @zhengruifeng I'm ok with the _convert to null_ method, I think there is no extra pass for data if we handle it with this way, and the DataFrame/RDD functions to compute _mean/median_

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 I test on dataframes containing `null`, both `avg` and `stat.approxQuantile` will ignore `null`. And if one column only contain `null`, `null` and `Array.empty[Double]` will be returned

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-16 Thread hhbyyh
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/18902 Thanks for the quick update. The implementation may be improved on some details. But first I'd want to confirm the "convert to null" method does not have any defect. @MLnick @srowen @yanboliang

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @hhbyyh I rewrite the impl, and now all `NaN` and `missingValue` will be transform to `null` at first, then current methods are used. For columns only containing `null`, `null` is

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80675/ Test PASSed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80675 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80675/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80675 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80675/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80667 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80667/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80667/ Test PASSed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80666/ Test PASSed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80666/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80663/ Test PASSed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80663/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80667/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80666/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80663/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80660/ Test FAILed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80660/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80660 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80660/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @hhbyyh Good Idea! We can also use this trick to compute median, because method

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-14 Thread hhbyyh
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/18902 Eh, I meant that it may be possible to get the mean values purely using DataFrame API. (convert missingValue/NaN to null) in one pass, so we may need to check the performance comparison. But I guess

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80653/ Test PASSed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-14 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80653 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80653/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-14 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80653/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 I test the performance on a small data, the value in the following table is the average duration in seconds: |numColums| Old Mean | Old Median | New Mean | New Median |

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @hhbyyh Yes, I will test the performance. Btw, the median computation by call `stat.approxQuantile` will also transform df into rdd before aggregation. see

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread hhbyyh
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/18902 Hi @zhengruifeng Thanks for the idea and implementation. Definitely something worth exploring. As I understand, the new implementation improves the locality yet it leverages RDD API

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80479/ Test PASSed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80479 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80479/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80479 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80479/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 Jenkis, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80478/ Test FAILed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80478 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80478/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80477/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80478/testReport)** for PR 18902 at commit

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80477/ Test FAILed. ---

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80477/testReport)** for PR 18902 at commit