[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-20 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2780 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enab

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-20 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59831875 Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have t

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59826477 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/404/consoleFull) for PR 2780 at commit [`18d0301`](https://github.com/

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-20 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59816923 @chouqin Thanks for the update! LGTM once the tests pass. @manishamde At some point, I hope the histogram functionality can be part of mllib/statistics/ espe

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59816513 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/404/consoleFull) for PR 2780 at commit [`18d0301`](https://github.com/a

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-19 Thread chouqin
Github user chouqin commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59676055 @jkbradley I updated unit test to check splits returned by `findSplitsForContinuousFeature` are distinct. I have run the unit test for it and it passed. @mengxr

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-19 Thread manishamde
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59664847 @jkbradley I read the paper by Sanku et al and other papers but they required a custom implementation. The sort method has worked OK so far but I was hoping somebody w

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-19 Thread manishamde
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59664449 @chouqin LGTM. :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-19 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59661579 @chouqin Thanks for the updates! The updates look good. One more small comment: Could you please add explicit checks in the unit tests to make sure the returne

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread chouqin
Github user chouqin commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59637234 Jekins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have th

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread chouqin
Github user chouqin commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59637054 @manishamde @jkbradley thanks for your comments, I have changed my code now. Do you have any more suggestions? --- If your project is set up for it, you can reply to thi

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread chouqin
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19057399 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Loggin

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread chouqin
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19057305 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Loggin

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59603338 @chouqin Thanks for this PR! This method should be a real improvement. I added some small comments inline. My main concern right now is testing edge cases, l

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052550 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Logg

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052553 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Logg

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052548 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -912,16 +913,18 @@ object DecisionTree extends Serializable with Loggin

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052546 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -912,16 +913,18 @@ object DecisionTree extends Serializable with Loggin

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052555 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Logg

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052556 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Logg

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052549 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Logg

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052547 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -912,16 +913,18 @@ object DecisionTree extends Serializable with Loggin

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052551 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Logg

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052554 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Logg

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r19052544 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -36,6 +36,7 @@ import org.apache.spark.mllib.tree.model._ import o

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-17 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59590801 @chouqin Sorry for the slow response! About the RandomForestSuite failure: The change to fix the failure (maxBins) is OK with me. It is a somewhat brittle tes

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-14 Thread manishamde
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59064525 @chouqin Good point. The sort operation is the bottleneck, not the value count. It might be good to check numbers with a test run to verify whether the performance hit

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-14 Thread chouqin
Github user chouqin commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59007084 @manishamde thanks for your comments. I will adjust my code after #2785 gets merged. As for performance, yes, this is slower than the current implementation, bu

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread manishamde
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58991243 @chouqin Thanks for the PR. It looks good to me in general. If I understand correctly, we are histogramming (value counting) each feature instead of just simpl

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread manishamde
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r18809250 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Log

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread manishamde
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r18809178 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala --- @@ -102,6 +102,37 @@ class DecisionTreeSuite extends FunSuite with

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread manishamde
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r18809123 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Log

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread manishamde
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r18808821 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Log

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread manishamde
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r18808756 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1011,4 +1014,99 @@ object DecisionTree extends Serializable with Log

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread manishamde
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/2780#discussion_r18808651 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala --- @@ -93,7 +93,8 @@ class RandomForestSuite extends FunSuite with Loc

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58883501 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21686/consoleFull) for PR 2780 at commit [`d353596`](https://github.com/a

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58883509 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58876511 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21686/consoleFull) for PR 2780 at commit [`d353596`](https://github.com/ap

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58872495 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21685/consoleFull) for PR 2780 at commit [`9e64699`](https://github.com/a

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58872500 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58872046 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21684/consoleFull) for PR 2780 at commit [`9e64699`](https://github.com/a

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58872054 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread chouqin
Github user chouqin commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58865951 @jkbradley, RandomForestSuite fails because original splits are better fit for the training data(for example, 899.5 is a split threshold, which is close to 900.) I think

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58865777 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21685/consoleFull) for PR 2780 at commit [`9e64699`](https://github.com/ap

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58865274 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21684/consoleFull) for PR 2780 at commit [`9e64699`](https://github.com/ap

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread chouqin
Github user chouqin commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58865080 Jekins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58857601 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58857589 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21680/consoleFull) for PR 2780 at commit [`092efcb`](https://github.com/a

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58853271 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21680/consoleFull) for PR 2780 at commit [`092efcb`](https://github.com/ap

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-58853083 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pro

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-12 Thread chouqin
GitHub user chouqin opened a pull request: https://github.com/apache/spark/pull/2780 [SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively DecisionTree splits on continuous features by choosing an array of values from a subsample of the data.