[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/8246 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146383382 Merging with master. Would you be able to make a similar change for the spark.ml implementation? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146366058 [Test build #1855 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1855/console) for PR 8246 at commit [`abdcc47`](https://github.com/apache/spark/commit/abdcc47bd69624643908fb1924b94f22659e78f5). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146353045 [Test build #1855 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1855/consoleFull) for PR 8246 at commit [`abdcc47`](https://github.com/apache/spark/commit/abdcc47bd69624643908fb1924b94f22659e78f5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146352937 LGTM pending tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146348263 There were already tests for the returned split lengths, so I just removed the metadata checks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146073328 OK thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146009085 I'll have time tomorrow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146007438 Would you mind updating this? If you don't have time, then let me know, and I'll send a PR to your PR. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-140873775 I think it's because I asked you to remove the metadata update from that method which now runs on the worker. The test could be modified to look at the length of the returned splits. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139792166 I tend to rebase out of habit to prevent merge-build failures. I'll look at the test failure on Monday, they were all passing at one point. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139610433 [Test build #1743 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1743/console) for PR 8246 at commit [`93c9e71`](https://github.com/apache/spark/commit/93c9e71f6c62b122647c4bdd2f9dead710ae360f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139603660 [Test build #1743 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1743/consoleFull) for PR 8246 at commit [`93c9e71`](https://github.com/apache/spark/commit/93c9e71f6c62b122647c4bdd2f9dead710ae360f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139603364 Did you have to rebase b/c of merge conflicts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139603240 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/8246#discussion_r39216747 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging { } } + private def findSplitsBinsBySorting( + input: RDD[LabeledPoint], + metadata: DecisionTreeMetadata, + continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], Array[Array[Bin]]) = { +def findSplits( +featureIndex: Int, +featureSamples: Iterable[Double]): (Int, (Array[Split], Array[Bin])) = { + val splits = { +val featureSplits = findSplitsForContinuousFeature( + featureSamples.toArray, + metadata, + featureIndex) +logDebug(s"featureIndex = $featureIndex, numSplits = ${featureSplits.length}") + +featureSplits.map(threshold => new Split(featureIndex, threshold, Continuous, Nil)) + } + + val bins = { +val lowSplit = new DummyLowSplit(featureIndex, Continuous) +val highSplit = new DummyHighSplit(featureIndex, Continuous) +(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map { + case Seq(lhs, right) => new Bin(lhs, right, Continuous, Double.MinValue) +}.toArray + } + + (featureIndex, (splits, bins)) +} + +val continuousSplits = input + .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx + .groupByKey(numPartitions = math.min(continuousFeatures.length, input.partitions.length)) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/8246#discussion_r39216763 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging { } } + private def findSplitsBinsBySorting( + input: RDD[LabeledPoint], + metadata: DecisionTreeMetadata, + continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], Array[Array[Bin]]) = { +def findSplits( +featureIndex: Int, +featureSamples: Iterable[Double]): (Int, (Array[Split], Array[Bin])) = { + val splits = { +val featureSplits = findSplitsForContinuousFeature( + featureSamples.toArray, + metadata, + featureIndex) +logDebug(s"featureIndex = $featureIndex, numSplits = ${featureSplits.length}") + +featureSplits.map(threshold => new Split(featureIndex, threshold, Continuous, Nil)) + } + + val bins = { +val lowSplit = new DummyLowSplit(featureIndex, Continuous) +val highSplit = new DummyHighSplit(featureIndex, Continuous) +(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map { --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/8246#discussion_r39216729 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging { } } + private def findSplitsBinsBySorting( + input: RDD[LabeledPoint], + metadata: DecisionTreeMetadata, + continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], Array[Array[Bin]]) = { +def findSplits( +featureIndex: Int, +featureSamples: Iterable[Double]): (Int, (Array[Split], Array[Bin])) = { + val splits = { +val featureSplits = findSplitsForContinuousFeature( + featureSamples.toArray, + metadata, + featureIndex) +logDebug(s"featureIndex = $featureIndex, numSplits = ${featureSplits.length}") + +featureSplits.map(threshold => new Split(featureIndex, threshold, Continuous, Nil)) + } + + val bins = { +val lowSplit = new DummyLowSplit(featureIndex, Continuous) +val highSplit = new DummyHighSplit(featureIndex, Continuous) +(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map { + case Seq(lhs, right) => new Bin(lhs, right, Continuous, Double.MinValue) +}.toArray + } + + (featureIndex, (splits, bins)) +} + +val continuousSplits = input + .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx + .groupByKey(numPartitions = math.min(continuousFeatures.length, input.partitions.length)) + .map { case (k, v) => findSplits(k, v) } + .collectAsMap() + +val numFeatures = metadata.numFeatures +val (splits, bins) = Range(0, numFeatures).unzip { + case i if metadata.isContinuous(i) => +val (split, bin) = continuousSplits(i) +metadata.setNumSplits(i, split.length) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139350246 I see. I guess this is the issue: [https://issues.apache.org/jira/browse/SPARK-8418] We can discuss more there if it's helpful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139349170 @jkbradley i was looking at `Bucketizer`, which seemed to be closest to what i wanted (cumulative density transform), and i certainly could have been doing something wrong... it's not relevant to this ticket though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139346952 You should be able to assemble all of the features into one feature vector (using VectorAssembler) and then applying Normalizer, StandardScaler, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139345239 @jkbradley it's not that uncommon to use sparse features for time series classification (sparse interval features) with RF. training a linear model with spark.ml's pipeline is (was?) prohibitively expensive for this dataset. normalizing features using the spark.ml's transforms generates two stages per feature... rather than two stages total (the first for density estimation and a second for the bucket transform) but that sort of defeats the purpose of using `ml` instead of `mllib`.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139341684 Considering sparse representations would be nice, though I think that's a less common use case for trees (vs GLMs). The other (bigger) issue is communication (which is what my new implementation will address). The main difference between MLlib and PLANET right now is that MLlib does not switch to local learning for deep layers of the tree. For comparing, I'm mainly curious about the total number of nodes in the tree, just to make sure they are fairly similar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139339009 @jkbradley yeah, I figured out as much. a lot of the internals also assume dense feature space, which isn't very common with a large number of features. i tried it with `useNodeIdCache` on and off and it didn't make much of a difference. checkpointing was also enabled... the dataset fits in memory though (8TB was allocated to RDD storage), i don't know how much that changes things. the lower levels of the trees (near the roots) look similar but i haven't done a thorough comparison of the two. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139337078 @NathanHowell The real problem for MLlib is having a large number of features; I'd say a few thousand is a reasonable limit right now. I'm working on a new implementation which should be much faster for more features, scheduled for 1.6. Curious: When you ran MLlib, did you try useNodeIdCache, with checkpointing turned on? That should help a little. Also, do you know if the 2 implementations learned similarly sized trees? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139334977 @manishamde yes, same parameters. this dataset is about 100m examples, not sure offhand on the exact number of features but probably about 5k categorical features (typically with single digit arity) and 100-200k continuous features. the binning slowdown can also be avoided by using a one-hot encoding but it's not convenient. having per feature bin counts would be nice, and it can (probably?) be ignored for categorical features altogether. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139331864 @NathanHowell @jkbradley We should consider making bins per feature and sample sizes configurable to avoid the side-effects mentioned above. Did the brushfire implementation also have similar number of splits and tree depth? Also, could you describe the training data a little more -- number of samples, number of categorical/continuous features, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139323929 @jkbradley yes, this fixed a real problem. the problem stemmed from attempting to use categorical features with high arity (250+), which has a side effect of increasing the number of bins used for continuous features and massively increasing the sample size. the sampling and bin calculation would take multiple hours. after the patch it completes in ~2 minutes. training itself would run okay but also wasn't terribly fast. I've had to switch to a different RF implementation (https://github.com/stripe/brushfire) in order to have training complete in a reasonable amount of time. the mllib implementation would train a single tree in 3-4 days on 2000 executors, whereas brushfire does the same in about 6 hours with per-split bin calculation and about 45 minutes with log bucketing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/8246#discussion_r39185342 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging { } } + private def findSplitsBinsBySorting( + input: RDD[LabeledPoint], + metadata: DecisionTreeMetadata, + continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], Array[Array[Bin]]) = { +def findSplits( +featureIndex: Int, +featureSamples: Iterable[Double]): (Int, (Array[Split], Array[Bin])) = { + val splits = { +val featureSplits = findSplitsForContinuousFeature( + featureSamples.toArray, + metadata, + featureIndex) +logDebug(s"featureIndex = $featureIndex, numSplits = ${featureSplits.length}") + +featureSplits.map(threshold => new Split(featureIndex, threshold, Continuous, Nil)) + } + + val bins = { +val lowSplit = new DummyLowSplit(featureIndex, Continuous) +val highSplit = new DummyHighSplit(featureIndex, Continuous) +(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map { + case Seq(lhs, right) => new Bin(lhs, right, Continuous, Double.MinValue) +}.toArray + } + + (featureIndex, (splits, bins)) +} + +val continuousSplits = input + .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx + .groupByKey(numPartitions = math.min(continuousFeatures.length, input.partitions.length)) + .map { case (k, v) => findSplits(k, v) } + .collectAsMap() + +val numFeatures = metadata.numFeatures +val (splits, bins) = Range(0, numFeatures).unzip { + case i if metadata.isContinuous(i) => +val (split, bin) = continuousSplits(i) +metadata.setNumSplits(i, split.length) --- End diff -- Can you please remove the similar call to this in findSplitsForContinuousFeature since that will now be run on workers? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139311498 @NathanHowell I just looked this over, and it seems fine. A few questions: Did you encounter this problem in practice? I would have thought that choosing splits would take much less time than training itself. Was training itself fast for you? Note: There are currently 2 tree implementations, one in spark.mllib (which you're modifying) and one in spark.ml. I eventually want to remove the spark.mllib one and wrap the spark.ml one [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala]. We can commit this here and then copy it over to spark.ml in another PR. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/8246#discussion_r38249912 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging { } } + private def findSplitsBinsBySorting( + input: RDD[LabeledPoint], + metadata: DecisionTreeMetadata, + continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], Array[Array[Bin]]) = { +def findSplits( +featureIndex: Int, +featureSamples: Iterable[Double]): (Int, (Array[Split], Array[Bin])) = { + val splits = { +val featureSplits = findSplitsForContinuousFeature( + featureSamples.toArray, + metadata, + featureIndex) +logDebug(s"featureIndex = $featureIndex, numSplits = ${featureSplits.length}") + +featureSplits.map(threshold => new Split(featureIndex, threshold, Continuous, Nil)) + } + + val bins = { +val lowSplit = new DummyLowSplit(featureIndex, Continuous) +val highSplit = new DummyHighSplit(featureIndex, Continuous) +(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map { + case Seq(lhs, right) => new Bin(lhs, right, Continuous, Double.MinValue) +}.toArray + } + + (featureIndex, (splits, bins)) +} + +val continuousSplits = input + .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx + .groupByKey(numPartitions = math.min(continuousFeatures.length, input.partitions.length)) --- End diff -- numPartitions logic could be extracted out in a separate line and a small explanation would help. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/8246#discussion_r38249715 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1056,6 +988,70 @@ object DecisionTree extends Serializable with Logging { } } + private def findSplitsBinsBySorting( + input: RDD[LabeledPoint], + metadata: DecisionTreeMetadata, + continuousFeatures: IndexedSeq[Int]): (Array[Array[Split]], Array[Array[Bin]]) = { +def findSplits( +featureIndex: Int, +featureSamples: Iterable[Double]): (Int, (Array[Split], Array[Bin])) = { + val splits = { +val featureSplits = findSplitsForContinuousFeature( + featureSamples.toArray, + metadata, + featureIndex) +logDebug(s"featureIndex = $featureIndex, numSplits = ${featureSplits.length}") + +featureSplits.map(threshold => new Split(featureIndex, threshold, Continuous, Nil)) + } + + val bins = { +val lowSplit = new DummyLowSplit(featureIndex, Continuous) +val highSplit = new DummyHighSplit(featureIndex, Continuous) +(lowSplit +: splits.toSeq :+ highSplit).sliding(2).map { --- End diff -- A comment here will be helpful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-135866073 Thanks @NathanHowell Sorry for not responding earlier. Will try to review soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-132293317 Will do, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-132291972 This sounds great, thanks! I'll need to finish up with QA for 1.5 before taking a look, but please ping me if I don't return to review before long. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-132283376 cc/ @jkbradley @manishamde @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-131931086 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...
GitHub user NathanHowell opened a pull request: https://github.com/apache/spark/pull/8246 [SPARK-10064] [ML] Parallelize decision tree bin split calculations Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation. With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours. You can merge this pull request into a Git repository by running: $ git pull https://github.com/NathanHowell/spark SPARK-10064 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8246.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8246 commit 93c9e71f6c62b122647c4bdd2f9dead710ae360f Author: Nathan Howell Date: 2015-08-14T23:39:20Z [SPARK-10064] [ML] Parallelize decision tree bin split calculations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org