GitHub user codedeft opened a pull request:
https://github.com/apache/spark/pull/2868
[SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci...
...sion trees. @jkbradley @mengxr @chouquin Please review this.
You can merge this pull request into a Git repository by
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59869002
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pro
Github user dbtsai commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59871504
Jenkins, please start the test!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not ha
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59877524
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59877519
Jenkins, add to whitelist.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have th
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59877748
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21959/consoleFull)
for PR 2868 at commit
[`9ea76df`](https://github.com/ap
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59877812
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59877810
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21959/consoleFull)
for PR 2868 at commit
[`9ea76df`](https://github.com/a
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59878898
Seems like lots of line too long messages. Will address this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as w
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59879666
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this featu
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59879975
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21965/consoleFull)
for PR 2868 at commit
[`6b05af0`](https://github.com/ap
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59880277
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21966/consoleFull)
for PR 2868 at commit
[`13585e8`](https://github.com/ap
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59884328
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21965/consoleFull)
for PR 2868 at commit
[`6b05af0`](https://github.com/a
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59884334
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59884734
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21966/consoleFull)
for PR 2868 at commit
[`13585e8`](https://github.com/a
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59884738
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21
Github user chouqin commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19132675
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Founda
Github user chouqin commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19132973
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Founda
Github user chouqin commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59893259
@codedeft Thanks for your nice work. I have added some comments inline.
Here are some high level comments:
1. Have you tested the performance after this change?As
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59974518
@chouqin Checkpointing is helpful since it is more persistent than
persist(). Checkpointing stores data to HDFS (with replication), so that the
RDD is stored even if
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59975425
@codedeft Thanks for the PR! I'll make a pass now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19173742
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19174472
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19174918
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19175188
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19176539
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19176858
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19177555
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -515,6 +523,34 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19177832
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -515,6 +523,34 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19177914
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19178079
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19178115
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19178228
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -584,6 +642,9 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19178539
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -629,6 +699,10 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59998408
@codedeft Done with a pass. It's looking quite good. My main comments
are about code duplication and simplification; I like the general approach.
Let me know when I
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60013991
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pro
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19195461
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19195486
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19195515
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19195544
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19195587
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19195595
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -584,6 +642,9 @@ object DecisionTree extends Serializable with Logging {
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60040201
Thanks for all the comments guys. I'll address them and resubmit.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19195598
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -629,6 +699,10 @@ object DecisionTree extends Serializable with Logging
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19195610
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -515,6 +523,34 @@ object DecisionTree extends Serializable with Logging
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19195671
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19247637
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19247666
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19247743
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19249614
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Found
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19300415
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foun
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19306308
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Found
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60634290
CC: @manishamde If you have time to take a look!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60707307
[Test build #22332 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22332/consoleFull)
for PR 2868 at commit
[`e08ef62`](https://githu
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60711275
[Test build #22332 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22332/consoleFull)
for PR 2868 at commit
[`e08ef62`](https://gith
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60711278
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60712109
Currently doing some performance testing.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60713165
Here's one number. But this requires constant re-caching new node Id caches
and unpersisting old node Id caches that is not reflected in the code yet. I'm
not sure if fr
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60721861
[Test build #22351 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22351/consoleFull)
for PR 2868 at commit
[`58a7b3e`](https://githu
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60723953
Updated codes that at every iteration, persist new cache values while
unpersisting old values have been submitted.
---
If your project is set up for it, you can reply t
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60729378
[Test build #22351 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22351/consoleFull)
for PR 2868 at commit
[`58a7b3e`](https://gith
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60729383
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19508499
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -584,6 +648,13 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19508498
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -613,6 +684,14 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19508647
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19508975
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19509528
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19509616
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19509685
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19509769
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60845877
@codedeft Thanks for the updates! I added some small comments above.
Feel free to ignore the OpenHashMap suggestion, unless you find a problem in
your tests. After
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19510109
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -613,6 +684,14 @@ object DecisionTree extends Serializable with Logging
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19510115
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19510120
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19510132
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19510480
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19510465
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Found
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19510500
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -584,6 +648,13 @@ object DecisionTree extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19512071
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foun
Github user codedeft commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19513979
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Found
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-60863634
Another thought: This checkpointing logic seems like it will be useful for
a bunch of algorithms in the future. It would be nice to abstract it into some
class which h
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19518756
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundat
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19518754
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundat
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61026533
I've been doing some larger dataset (8 million rows with 784 features)
testing on node Id cache and I don't think that node Id cache will do much for
shallow trees. I'm
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61037009
I agree local sub-tree training will be needed to train deep trees. That
should probably be the next priority. I'm running some tests now and will see
if I see differ
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61153266
I got similar results on 16 nodes using MNIST8m; basically no change in
runtime (+/- a few percent at most). But those tests were for shallow trees.
I worry that this
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61155725
Yea, I'm trying to run depth 30 tests, but I got failures (both without and
with node Id cache) that seem to happen often in our clusters when using
TorrentBroadcast. Tr
Github user manishamde commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61156969
@codedeft @jkbradley I have not followed the discussion very closely
(apologies!) but at the high level, could we add local training support along
with this PR possibl
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61165010
I too hope caching will be useful later on. One last thing I'm trying is
running locally (on a beefier machine than my laptop). If it helps in local
mode, it might be
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61167709
I ran some local tests but did not see any speedups. This was trying to
mimic your earlier test:
* original mnist dataset
* depths 5, 10, 20, and 30
* 1 com
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61170031
Hm, I see. I'll try testing again on the small mnist but my previous test
was on a cluster with 8 executors. However, I realize now that it probably only
utilized 2 our
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61173555
I agree it probably only used 2 executors since there were only 2
partitions for the data. (I think reduceByKey uses the same partitioner by
default.)
I think
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61179636
[Test build #22565 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22565/consoleFull)
for PR 2868 at commit
[`54656c5`](https://githu
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61188389
[Test build #22565 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22565/consoleFull)
for PR 2868 at commit
[`54656c5`](https://gith
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61188396
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61189986
Ok, my performance test on the small mnist is still consistent (100 trees,
30 depth limit). I think that the big reason for this is that when it's
actually running in a
Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61190259
I've addressed the comments. Please review at your convenience. I'll
publish some big data results once they are actually done.
Thanks!
---
If your project is
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61212783
Your test analysis is pretty convincing! Keeping the PR sounds good.
---
If your project is set up for it, you can reply to this email and have your
reply appear on Gi
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19686161
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala ---
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foun
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2868#discussion_r19686166
--- Diff:
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
---
@@ -102,6 +105,15 @@ object DecisionTreeRunner {
1 - 100 of 122 matches
Mail list logo