[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8246


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-07 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-146383382
  
Merging with master.  Would you be able to make a similar change for the 
spark.ml implementation?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-146366058
  
  [Test build #1855 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1855/console)
 for   PR 8246 at commit 
[`abdcc47`](https://github.com/apache/spark/commit/abdcc47bd69624643908fb1924b94f22659e78f5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-146353045
  
  [Test build #1855 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1855/consoleFull)
 for   PR 8246 at commit 
[`abdcc47`](https://github.com/apache/spark/commit/abdcc47bd69624643908fb1924b94f22659e78f5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-07 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-146352937
  
LGTM pending tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-07 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-146348263
  
There were already tests for the returned split lengths, so I just removed 
the metadata checks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-06 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-146073328
  
OK thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-06 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-146009085
  
I'll have time tomorrow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-06 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-146007438
  
Would you mind updating this?  If you don't have time, then let me know, 
and I'll send a PR to your PR.  Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-16 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-140873775
  
I think it's because I asked you to remove the metadata update from that 
method which now runs on the worker.  The test could be modified to look at the 
length of the returned splits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-12 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139792166
  
I tend to rebase out of habit to prevent merge-build failures. I'll look at 
the test failure on Monday, they were all passing at one point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139610433
  
  [Test build #1743 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1743/console)
 for   PR 8246 at commit 
[`93c9e71`](https://github.com/apache/spark/commit/93c9e71f6c62b122647c4bdd2f9dead710ae360f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139603660
  
  [Test build #1743 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1743/consoleFull)
 for   PR 8246 at commit 
[`93c9e71`](https://github.com/apache/spark/commit/93c9e71f6c62b122647c4bdd2f9dead710ae360f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-11 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139603364
  
Did you have to rebase b/c of merge conflicts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-11 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139603240
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request:

https://github.com/apache/spark/pull/8246#discussion_r39216747
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging 
{
 }
   }
 
+  private def findSplitsBinsBySorting(
+  input: RDD[LabeledPoint],
+  metadata: DecisionTreeMetadata,
+  continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], 
Array[Array[Bin]]) = {
+def findSplits(
+featureIndex: Int,
+featureSamples: Iterable[Double]): (Int, (Array[Split], 
Array[Bin])) = {
+  val splits = {
+val featureSplits = findSplitsForContinuousFeature(
+  featureSamples.toArray,
+  metadata,
+  featureIndex)
+logDebug(s"featureIndex = $featureIndex, numSplits = 
${featureSplits.length}")
+
+featureSplits.map(threshold => new Split(featureIndex, threshold, 
Continuous, Nil))
+  }
+
+  val bins = {
+val lowSplit = new DummyLowSplit(featureIndex, Continuous)
+val highSplit = new DummyHighSplit(featureIndex, Continuous)
+(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map {
+  case Seq(lhs, right) => new Bin(lhs, right, Continuous, 
Double.MinValue)
+}.toArray
+  }
+
+  (featureIndex, (splits, bins))
+}
+
+val continuousSplits = input
+  .flatMap(point => continuousFeatures.map(idx => (idx, 
point.features(idx
+  .groupByKey(numPartitions = math.min(continuousFeatures.length, 
input.partitions.length))
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request:

https://github.com/apache/spark/pull/8246#discussion_r39216763
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging 
{
 }
   }
 
+  private def findSplitsBinsBySorting(
+  input: RDD[LabeledPoint],
+  metadata: DecisionTreeMetadata,
+  continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], 
Array[Array[Bin]]) = {
+def findSplits(
+featureIndex: Int,
+featureSamples: Iterable[Double]): (Int, (Array[Split], 
Array[Bin])) = {
+  val splits = {
+val featureSplits = findSplitsForContinuousFeature(
+  featureSamples.toArray,
+  metadata,
+  featureIndex)
+logDebug(s"featureIndex = $featureIndex, numSplits = 
${featureSplits.length}")
+
+featureSplits.map(threshold => new Split(featureIndex, threshold, 
Continuous, Nil))
+  }
+
+  val bins = {
+val lowSplit = new DummyLowSplit(featureIndex, Continuous)
+val highSplit = new DummyHighSplit(featureIndex, Continuous)
+(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map {
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request:

https://github.com/apache/spark/pull/8246#discussion_r39216729
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging 
{
 }
   }
 
+  private def findSplitsBinsBySorting(
+  input: RDD[LabeledPoint],
+  metadata: DecisionTreeMetadata,
+  continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], 
Array[Array[Bin]]) = {
+def findSplits(
+featureIndex: Int,
+featureSamples: Iterable[Double]): (Int, (Array[Split], 
Array[Bin])) = {
+  val splits = {
+val featureSplits = findSplitsForContinuousFeature(
+  featureSamples.toArray,
+  metadata,
+  featureIndex)
+logDebug(s"featureIndex = $featureIndex, numSplits = 
${featureSplits.length}")
+
+featureSplits.map(threshold => new Split(featureIndex, threshold, 
Continuous, Nil))
+  }
+
+  val bins = {
+val lowSplit = new DummyLowSplit(featureIndex, Continuous)
+val highSplit = new DummyHighSplit(featureIndex, Continuous)
+(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map {
+  case Seq(lhs, right) => new Bin(lhs, right, Continuous, 
Double.MinValue)
+}.toArray
+  }
+
+  (featureIndex, (splits, bins))
+}
+
+val continuousSplits = input
+  .flatMap(point => continuousFeatures.map(idx => (idx, 
point.features(idx
+  .groupByKey(numPartitions = math.min(continuousFeatures.length, 
input.partitions.length))
+  .map { case (k, v) => findSplits(k, v) }
+  .collectAsMap()
+
+val numFeatures = metadata.numFeatures
+val (splits, bins) = Range(0, numFeatures).unzip {
+  case i if metadata.isContinuous(i) =>
+val (split, bin) = continuousSplits(i)
+metadata.setNumSplits(i, split.length)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139350246
  
I see.  I guess this is the issue: 
[https://issues.apache.org/jira/browse/SPARK-8418]  We can discuss more there 
if it's helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139349170
  
@jkbradley i was looking at `Bucketizer`, which seemed to be closest to 
what i wanted (cumulative density transform), and i certainly could have been 
doing something wrong... it's not relevant to this ticket though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139346952
  
You should be able to assemble all of the features into one feature vector 
(using VectorAssembler) and then applying Normalizer, StandardScaler, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139345239
  
@jkbradley it's not that uncommon to use sparse features for time series 
classification (sparse interval features) with RF.

training a linear model with spark.ml's pipeline is (was?) prohibitively 
expensive for this dataset. normalizing features using the spark.ml's 
transforms generates two stages per feature... rather than two stages total 
(the first for density estimation and a second for the bucket transform) but 
that sort of defeats the purpose of using `ml` instead of `mllib`..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139341684
  
Considering sparse representations would be nice, though I think that's a 
less common use case for trees (vs GLMs).  The other (bigger) issue is 
communication (which is what my new implementation will address).  The main 
difference between MLlib and PLANET right now is that MLlib does not switch to 
local learning for deep layers of the tree.

For comparing, I'm mainly curious about the total number of nodes in the 
tree, just to make sure they are fairly similar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139339009
  
@jkbradley yeah, I figured out as much. a lot of the internals also assume 
dense feature space, which isn't very common with a large number of features.

i tried it with `useNodeIdCache` on and off and it didn't make much of a 
difference. checkpointing was also enabled... the dataset fits in memory though 
(8TB was allocated to RDD storage), i don't know how much that changes things.

the lower levels of the trees (near the roots) look similar but i haven't 
done a thorough comparison of the two.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139337078
  
@NathanHowell  The real problem for MLlib is having a large number of 
features; I'd say a few thousand is a reasonable limit right now.  I'm working 
on a new implementation which should be much faster for more features, 
scheduled for 1.6.

Curious: When you ran MLlib, did you try useNodeIdCache, with checkpointing 
turned on?  That should help a little.

Also, do you know if the 2 implementations learned similarly sized trees?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139334977
  
@manishamde yes, same parameters. this dataset is about 100m examples, not 
sure offhand on the exact number of features but probably about 5k categorical 
features (typically with single digit arity) and 100-200k continuous features.

the binning slowdown can also be avoided by using a one-hot encoding but 
it's not convenient. having per feature bin counts would be nice, and it can 
(probably?) be ignored for categorical features altogether.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139331864
  
@NathanHowell @jkbradley We should consider making bins per feature and 
sample sizes configurable to avoid the side-effects mentioned above.

Did the brushfire implementation also have similar number of splits and 
tree depth? Also, could you describe the training data a little more -- number 
of samples, number of categorical/continuous features, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139323929
  
@jkbradley yes, this fixed a real problem. the problem stemmed from 
attempting to use categorical features with high arity (250+), which has a side 
effect of increasing the number of bins used for continuous features and 
massively increasing the sample size. the sampling and bin calculation would 
take multiple hours. after the patch it completes in ~2 minutes.

training itself would run okay but also wasn't terribly fast. I've had to 
switch to a different RF implementation (https://github.com/stripe/brushfire) 
in order to have training complete in a reasonable amount of time. the mllib 
implementation would train a single tree in 3-4 days on 2000 executors, whereas 
brushfire does the same in about 6 hours with per-split bin calculation and 
about 45 minutes with log bucketing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8246#discussion_r39185342
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging 
{
 }
   }
 
+  private def findSplitsBinsBySorting(
+  input: RDD[LabeledPoint],
+  metadata: DecisionTreeMetadata,
+  continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], 
Array[Array[Bin]]) = {
+def findSplits(
+featureIndex: Int,
+featureSamples: Iterable[Double]): (Int, (Array[Split], 
Array[Bin])) = {
+  val splits = {
+val featureSplits = findSplitsForContinuousFeature(
+  featureSamples.toArray,
+  metadata,
+  featureIndex)
+logDebug(s"featureIndex = $featureIndex, numSplits = 
${featureSplits.length}")
+
+featureSplits.map(threshold => new Split(featureIndex, threshold, 
Continuous, Nil))
+  }
+
+  val bins = {
+val lowSplit = new DummyLowSplit(featureIndex, Continuous)
+val highSplit = new DummyHighSplit(featureIndex, Continuous)
+(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map {
+  case Seq(lhs, right) => new Bin(lhs, right, Continuous, 
Double.MinValue)
+}.toArray
+  }
+
+  (featureIndex, (splits, bins))
+}
+
+val continuousSplits = input
+  .flatMap(point => continuousFeatures.map(idx => (idx, 
point.features(idx
+  .groupByKey(numPartitions = math.min(continuousFeatures.length, 
input.partitions.length))
+  .map { case (k, v) => findSplits(k, v) }
+  .collectAsMap()
+
+val numFeatures = metadata.numFeatures
+val (splits, bins) = Range(0, numFeatures).unzip {
+  case i if metadata.isContinuous(i) =>
+val (split, bin) = continuousSplits(i)
+metadata.setNumSplits(i, split.length)
--- End diff --

Can you please remove the similar call to this in 
findSplitsForContinuousFeature since that will now be run on workers?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-139311498
  
@NathanHowell I just looked this over, and it seems fine.  A few questions:

Did you encounter this problem in practice?  I would have thought that 
choosing splits would take much less time than training itself.  Was training 
itself fast for you?

Note: There are currently 2 tree implementations, one in spark.mllib (which 
you're modifying) and one in spark.ml.  I eventually want to remove the 
spark.mllib one and wrap the spark.ml one 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala].
  We can commit this here and then copy it over to spark.ml in another PR.

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-08-28 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/8246#discussion_r38249912
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging 
{
 }
   }
 
+  private def findSplitsBinsBySorting(
+  input: RDD[LabeledPoint],
+  metadata: DecisionTreeMetadata,
+  continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], 
Array[Array[Bin]]) = {
+def findSplits(
+featureIndex: Int,
+featureSamples: Iterable[Double]): (Int, (Array[Split], 
Array[Bin])) = {
+  val splits = {
+val featureSplits = findSplitsForContinuousFeature(
+  featureSamples.toArray,
+  metadata,
+  featureIndex)
+logDebug(s"featureIndex = $featureIndex, numSplits = 
${featureSplits.length}")
+
+featureSplits.map(threshold => new Split(featureIndex, threshold, 
Continuous, Nil))
+  }
+
+  val bins = {
+val lowSplit = new DummyLowSplit(featureIndex, Continuous)
+val highSplit = new DummyHighSplit(featureIndex, Continuous)
+(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map {
+  case Seq(lhs, right) => new Bin(lhs, right, Continuous, 
Double.MinValue)
+}.toArray
+  }
+
+  (featureIndex, (splits, bins))
+}
+
+val continuousSplits = input
+  .flatMap(point => continuousFeatures.map(idx => (idx, 
point.features(idx
+  .groupByKey(numPartitions = math.min(continuousFeatures.length, 
input.partitions.length))
--- End diff --

numPartitions logic could be extracted out in a separate line and a small 
explanation would help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-08-28 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/8246#discussion_r38249715
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging 
{
 }
   }
 
+  private def findSplitsBinsBySorting(
+  input: RDD[LabeledPoint],
+  metadata: DecisionTreeMetadata,
+  continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], 
Array[Array[Bin]]) = {
+def findSplits(
+featureIndex: Int,
+featureSamples: Iterable[Double]): (Int, (Array[Split], 
Array[Bin])) = {
+  val splits = {
+val featureSplits = findSplitsForContinuousFeature(
+  featureSamples.toArray,
+  metadata,
+  featureIndex)
+logDebug(s"featureIndex = $featureIndex, numSplits = 
${featureSplits.length}")
+
+featureSplits.map(threshold => new Split(featureIndex, threshold, 
Continuous, Nil))
+  }
+
+  val bins = {
+val lowSplit = new DummyLowSplit(featureIndex, Continuous)
+val highSplit = new DummyHighSplit(featureIndex, Continuous)
+(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map {
--- End diff --

A comment here will be helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-08-28 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-135866073
  
Thanks @NathanHowell Sorry for not responding earlier. Will try to review 
soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-08-18 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-132293317
  
Will do, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-08-18 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-132291972
  
This sounds great, thanks!  I'll need to finish up with QA for 1.5 before 
taking a look, but please ping me if I don't return to review before long.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-08-18 Thread NathanHowell
Github user NathanHowell commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-132283376
  
cc/ @jkbradley @manishamde @mengxr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8246#issuecomment-131931086
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-08-17 Thread NathanHowell
GitHub user NathanHowell opened a pull request:

https://github.com/apache/spark/pull/8246

[SPARK-10064] [ML] Parallelize decision tree bin split calculations

Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin 
calculation.

With large feature spaces the current implementation is very slow. This 
change limits the features that are distributed (or collected) to just the 
continuous features, and performs the split calculations in parallel. It 
completes on a real multi terabyte dataset in less than a minute instead of 
multiple hours.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/NathanHowell/spark SPARK-10064

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8246


commit 93c9e71f6c62b122647c4bdd2f9dead710ae360f
Author: Nathan Howell 
Date:   2015-08-14T23:39:20Z

[SPARK-10064] [ML] Parallelize decision tree bin split calculations




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org