[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-30 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 OK, weight has been removed when calculating. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-29 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17556 The bucketing is trying to to bucket into buckets of equal P(x). It's a condition on P(y | x). That said the right point isn't knowable from the training data, and splitting to balance P(x) on

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-28 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 By the way, it's safe to use mean value as it is match the other libraries. If requested, I'd like to modify the PR. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-28 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 For a (train) sample of continuous series, say {x0, x1, x2, x3, ..., x100}. Now spark select quantile as split point. Suppose 10-quantiles is used, and x2 is 1st quantile, and x10 is 2nd

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-28 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17556 Ah OK I should think about this more first. Say you have a continuous predictor x and binary output y. Say the optimal split is found to be between 0.1 and 0.2, with 1 observation of 0.1 and 99 of

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-28 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17556 @sethah what's the issue there ... train/test ought to be from the same distribution, in theory. The empirical distribution of the test data will of course be a little different, but what is the

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-27 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17556 I don't mind the weighted midpoints. However, if for a continuous feature we find that many points have the exact same value, we are assuming we may find data points in the test set that are close

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-26 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3677 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3677/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-26 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3677 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3677/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-25 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 fix failed case, please retest it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3673/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3673 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3673/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-23 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 I scanned split critical of sklearn and xgboost. 1. sklearn count all continuous values and split at mean value. commit 5147fd09c6a063188efde444f47bd006fa5f95f0

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-23 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17556 That's good info. It's a tough call -- matching a known package is always nice. However I agree that a weighted split is a little more theoretically sound (don't have a reference on that though).

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-22 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 Hi, I has checked R GBM's code and found that: R's gbm uses mean value $(x + y) / 2$, not weighted mean $(c_x * x + c_y * y) / (c_x + c_y)$ described in [JIRA

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-14 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 @sethah Perhaps it's hard to compare R with Spark's behavior, since many factors involved. I'd like to read R GBM's code, and verify consistency of both side's design on split criteria. Is it OK?

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-13 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17556 Seems like a reasonable change. Just left some minor comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-13 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17556 If we are attempting to match R GBM, it would be great to show, at least on the PR, that we get the same results. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-13 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 many thanks, @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-13 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17556 It's looking good, and the R tests pass. I'll also ask @mengxr or maybe @dbtsai if they have any concerns about this change? --- If your project is set up for it, you can reply to this email and

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-13 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3662/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-13 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3662 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3662/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-12 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 I have ran all unit test case of MLlib in Python. However, I am not familiar with R, and I don't want waste too many time on deploying R's environment. Could CI retest the pr? We can

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-11 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17556 http://spark.apache.org/docs/latest/building-spark.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 @srowen Hi, I forget unit tests in python and R. Where can I find document about creating develop environment? thanks. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3655 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3655/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3655 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3655/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3654/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3654 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3654/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17556 Just a flaky test. Can't be related --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 ``` Test Result (1 failure / +1) org.apache.spark.storage.TopologyAwareBlockReplicationPolicyBehavior.Peers in 2 racks ``` Does anyone know what is this? --- If your

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread facaiy
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 is there something wrong with spark CI? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17556 **[Test build #3652 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3652/testReport)** for PR 17556 at commit

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-07 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17556 It seems OK to me but @sethah or @jkbradley might be good as a second set of eyes. It does slightly alter behavior, but, it does seem like something that should work better in general. --- If your

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17556 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this