[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread codedeft
GitHub user codedeft opened a pull request: https://github.com/apache/spark/pull/2868 [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... ...sion trees. @jkbradley @mengxr @chouquin Please review this. You can merge this pull request into a Git repository by

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59869002 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pro

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread dbtsai
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59871504 Jenkins, please start the test! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not ha

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59877524 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59877519 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have th

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59877748 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21959/consoleFull) for PR 2868 at commit [`9ea76df`](https://github.com/ap

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59877812 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59877810 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21959/consoleFull) for PR 2868 at commit [`9ea76df`](https://github.com/a

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59878898 Seems like lots of line too long messages. Will address this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as w

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59879666 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this featu

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59879975 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21965/consoleFull) for PR 2868 at commit [`6b05af0`](https://github.com/ap

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59880277 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21966/consoleFull) for PR 2868 at commit [`13585e8`](https://github.com/ap

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59884328 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21965/consoleFull) for PR 2868 at commit [`6b05af0`](https://github.com/a

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59884334 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59884734 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21966/consoleFull) for PR 2868 at commit [`13585e8`](https://github.com/a

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59884738 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread chouqin
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19132675 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Founda

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread chouqin
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19132973 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Founda

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread chouqin
Github user chouqin commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59893259 @codedeft Thanks for your nice work. I have added some comments inline. Here are some high level comments: 1. Have you tested the performance after this change?As

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59974518 @chouqin Checkpointing is helpful since it is more persistent than persist(). Checkpointing stores data to HDFS (with replication), so that the RDD is stored even if

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59975425 @codedeft Thanks for the PR! I'll make a pass now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19173742 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19174472 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19174918 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19175188 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19176539 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19176858 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19177555 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -515,6 +523,34 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19177832 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -515,6 +523,34 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19177914 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19178079 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19178115 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19178228 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -584,6 +642,9 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19178539 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -629,6 +699,10 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59998408 @codedeft Done with a pass. It's looking quite good. My main comments are about code duplication and simplification; I like the general approach. Let me know when I

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60013991 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pro

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195461 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195486 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195515 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195544 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195587 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195595 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -584,6 +642,9 @@ object DecisionTree extends Serializable with Logging {

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60040201 Thanks for all the comments guys. I'll address them and resubmit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195598 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -629,6 +699,10 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195610 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -515,6 +523,34 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195671 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19247637 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19247666 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19247743 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-22 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19249614 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19300415 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-23 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19306308 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-27 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60634290 CC: @manishamde If you have time to take a look! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-27 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60707307 [Test build #22332 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22332/consoleFull) for PR 2868 at commit [`e08ef62`](https://githu

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-27 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60711275 [Test build #22332 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22332/consoleFull) for PR 2868 at commit [`e08ef62`](https://gith

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60711278 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-27 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60712109 Currently doing some performance testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-27 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60713165 Here's one number. But this requires constant re-caching new node Id caches and unpersisting old node Id caches that is not reflected in the code yet. I'm not sure if fr

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60721861 [Test build #22351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22351/consoleFull) for PR 2868 at commit [`58a7b3e`](https://githu

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60723953 Updated codes that at every iteration, persist new cache values while unpersisting old values have been submitted. --- If your project is set up for it, you can reply t

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60729378 [Test build #22351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22351/consoleFull) for PR 2868 at commit [`58a7b3e`](https://gith

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60729383 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19508499 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -584,6 +648,13 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19508498 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -613,6 +684,14 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19508647 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19508975 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19509528 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19509616 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19509685 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19509769 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60845877 @codedeft Thanks for the updates! I added some small comments above. Feel free to ignore the OpenHashMap suggestion, unless you find a problem in your tests. After

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510109 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -613,6 +684,14 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510115 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510120 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510132 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510480 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510465 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510500 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -584,6 +648,13 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19512071 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19513979 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60863634 Another thought: This checkpointing logic seems like it will be useful for a bunch of algorithms in the future. It would be nice to abstract it into some class which h

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19518756 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundat

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19518754 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundat

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-29 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61026533 I've been doing some larger dataset (8 million rows with 784 features) testing on node Id cache and I don't think that node Id cache will do much for shallow trees. I'm

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-29 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61037009 I agree local sub-tree training will be needed to train deep trees. That should probably be the next priority. I'm running some tests now and will see if I see differ

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61153266 I got similar results on 16 nodes using MNIST8m; basically no change in runtime (+/- a few percent at most). But those tests were for shallow trees. I worry that this

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61155725 Yea, I'm trying to run depth 30 tests, but I got failures (both without and with node Id cache) that seem to happen often in our clusters when using TorrentBroadcast. Tr

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread manishamde
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61156969 @codedeft @jkbradley I have not followed the discussion very closely (apologies!) but at the high level, could we add local training support along with this PR possibl

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61165010 I too hope caching will be useful later on. One last thing I'm trying is running locally (on a beefier machine than my laptop). If it helps in local mode, it might be

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61167709 I ran some local tests but did not see any speedups. This was trying to mimic your earlier test: * original mnist dataset * depths 5, 10, 20, and 30 * 1 com

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61170031 Hm, I see. I'll try testing again on the small mnist but my previous test was on a cluster with 8 executors. However, I realize now that it probably only utilized 2 our

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61173555 I agree it probably only used 2 executors since there were only 2 partitions for the data. (I think reduceByKey uses the same partitioner by default.) I think

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61179636 [Test build #22565 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22565/consoleFull) for PR 2868 at commit [`54656c5`](https://githu

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61188389 [Test build #22565 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22565/consoleFull) for PR 2868 at commit [`54656c5`](https://gith

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61188396 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61189986 Ok, my performance test on the small mnist is still consistent (100 trees, 30 depth limit). I think that the big reason for this is that when it's actually running in a

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61190259 I've addressed the comments. Please review at your convenience. I'll publish some big data results once they are actually done. Thanks! --- If your project is

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61212783 Your test analysis is pretty convincing! Keeping the PR sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on Gi

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-31 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19686161 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-31 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19686166 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -102,6 +105,15 @@ object DecisionTreeRunner {

  1   2   >