[GitHub] spark pull request: [SPARK-1892][MLLIB] Adding OWL-QN optimizer fo...

2015-03-05 Thread codedeft
Github user codedeft closed the pull request at: https://github.com/apache/spark/pull/840 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-4197] [mllib] GradientBoosting API clea...

2014-11-04 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/3094#issuecomment-61769695 @jkbradley @manishamde @mengxr This is probably not the right place to communicate this. But FYI, I created a separate story for refining tree predictions for GB

[GitHub] spark pull request: [SPARK-4197] [mllib] GradientBoosting API clea...

2014-11-04 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/3094#issuecomment-61762230 Sounds good. I'll create a story for this. In addition to using internal formats for more efficiency, perhaps there are also some minor things su

[GitHub] spark pull request: [mllib] GradientBoosting API cleanup and examp...

2014-11-04 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/3094#issuecomment-61753980 @jkbradley @manishamde Is there a story for TreeBoost improvement for Gradient Boosting? TreeBoosting basically improves the gradient estimation at each iteration by re

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-11-02 Thread codedeft
Github user codedeft closed the pull request at: https://github.com/apache/spark/pull/2868 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-11-01 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61375497 It finally finished. 10 Trees, 30 depth limit. mnist8m, 20 executors: 15 hours with node Id cache. 21 hours without node Id cache. --- If your

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-31 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61358798 @mengxr @jkbradley Can you merge this? This is the only way you can effectively train 10 large trees with the mnist8m dataset. With node Id cache, it took a

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-31 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61358267 The conflict is caused by the GBoosting check-in. I'm taking a look. --- If your project is set up for it, you can reply to this email and have your reply appe

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-31 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61335866 Yea, I'm also getting Yarn compilation failure on my machine after doing the latest pull. Is this happening everywhere? --- If your project is set up for it, yo

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61190259 I've addressed the comments. Please review at your convenience. I'll publish some big data results once they are actually done. Thanks! --- If your

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61189986 Ok, my performance test on the small mnist is still consistent (100 trees, 30 depth limit). I think that the big reason for this is that when it's actually running

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61170031 Hm, I see. I'll try testing again on the small mnist but my previous test was on a cluster with 8 executors. However, I realize now that it probably only utilized

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61155725 Yea, I'm trying to run depth 30 tests, but I got failures (both without and with node Id cache) that seem to happen often in our clusters when using TorrentBroa

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-29 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-61026533 I've been doing some larger dataset (8 million rows with 784 features) testing on node Id cache and I don't think that node Id cache will do much for shallow

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2607#discussion_r19570062 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -26,7 +26,7 @@ import

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2607#discussion_r19569497 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala --- @@ -0,0 +1,433 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2607#discussion_r19567808 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala --- @@ -0,0 +1,433 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2607#discussion_r19567364 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -26,7 +26,7 @@ import

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2607#discussion_r19563689 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -26,7 +26,7 @@ import

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2607#discussion_r19563516 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala --- @@ -0,0 +1,433 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19513979 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510500 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -584,6 +648,13 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510465 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510480 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510132 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510120 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510115 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19510109 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -613,6 +684,14 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-28 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60723953 Updated codes that at every iteration, persist new cache values while unpersisting old values have been submitted. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-27 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60713165 Here's one number. But this requires constant re-caching new node Id caches and unpersisting old node Id caches that is not reflected in the code yet. I'm n

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-27 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60712109 Currently doing some performance testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-23 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19306308 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-22 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19249614 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195671 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -553,7 +589,26 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195610 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -515,6 +523,34 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195598 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -629,6 +699,10 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-60040201 Thanks for all the comments guys. I'll address them and resubmit. --- If your project is set up for it, you can reply to this email and have your reply appear on G

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195595 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -584,6 +642,9 @@ object DecisionTree extends Serializable with Logging

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195587 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195544 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195515 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195486 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-21 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2868#discussion_r19195461 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala --- @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59879666 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59878898 Seems like lots of line too long messages. Will address this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread codedeft
GitHub user codedeft opened a pull request: https://github.com/apache/spark/pull/2868 [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... ...sion trees. @jkbradley @mengxr @chouquin Please review this. You can merge this pull request into a Git repository by

[GitHub] spark pull request: [SPARK-1892][MLLIB] Adding OWL-QN optimizer fo...

2014-09-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/840#issuecomment-57415563 @debasish83 Yes. Or at least back when I tested it 4 months ago ;( --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark pull request: [SPARK-1892][MLLIB] Adding OWL-QN optimizer fo...

2014-09-30 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/840#issuecomment-57413615 @debasish83 We fixed the previously broken Breeze OWLQN in Breeze 0.8 and we know that the new Breeze OWLQN works as expected. However, this particular PR does not

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-24 Thread codedeft
Github user codedeft commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r18009800 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala --- @@ -128,13 +139,34 @@ private[tree] object

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-17 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2435#issuecomment-55976071 Additionally, I suppose allowing the actual size for feature subset as an input would be useful in model-search later on. --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-17 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2435#issuecomment-55975519 @jkbradley I guess that I don't have a particular preference, (either fraction or the actual number). The actual number seems a bit better to me since you are not

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-17 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2435#issuecomment-55974486 @jkbradley Thanks Joseph. It makes sense. It looks good upon very rough browsing. Some minor things: * Would be nice to have support for without

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-17 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2435#issuecomment-55971575 @jkbradley I don't quite get what different columns in result numbers mean. Do you mean that you are still training exactly the same single tree (to depth

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-17 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/2435#issuecomment-55967377 Hi Joseph, I'll take a look when I can, but this is a massive PR, so I'm not sure if I'll have time to go through this thoroughly. *

[GitHub] spark pull request: Adding OWL-QN optimizer for L1 regularizations...

2014-05-23 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/840#issuecomment-44043910 Done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: Adding OWL-QN optimizer for L1 regularizations...

2014-05-22 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/840#issuecomment-43957905 Breeze has been updated to 0.8. This should now work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark pull request: Adding OWL-QN optimizer for L1 regularizations...

2014-05-20 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/840#issuecomment-43667582 I'll try to get David to publish the latest breeze and change the project file to reference the latest breeze. --- If your project is set up for it, you can rep

[GitHub] spark pull request: Adding OWL-QN optimizer for L1 regularizations...

2014-05-20 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/840#issuecomment-43666271 To clarify - it requires the latest breeze. The OWL-QN in breeze had bugs, which I fixed. I'm not sure if David's published an official release yet but i

[GitHub] spark pull request: Adding OWL-QN optimizer for L1 regularizations...

2014-05-20 Thread codedeft
Github user codedeft commented on the pull request: https://github.com/apache/spark/pull/840#issuecomment-43665097 jira link : https://issues.apache.org/jira/browse/SPARK-1892 --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark pull request: Adding OWL-QN optimizer for L1 regularizations...

2014-05-20 Thread codedeft
GitHub user codedeft opened a pull request: https://github.com/apache/spark/pull/840 Adding OWL-QN optimizer for L1 regularizations. It can also handle L2 re... Adding OWL-QN optimizer for L1 regularizations. It can also handle L2 and L1 regularizations together (balanced with