[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/19666 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user tengpeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r175646373 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/DecisionTreeMetadata.scala --- @@ -152,15 +152,13 @@ private[spark] object DecisionTreeMetadata extends Logging { // TODO(SPARK-9957): Handle this properly by filtering out those features. if (numCategories > 1) { // Decide if some categorical features should be treated as unordered features, --- End diff -- Change , to . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user tengpeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r175646335 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/DecisionTreeMetadata.scala --- @@ -152,15 +152,13 @@ private[spark] object DecisionTreeMetadata extends Logging { // TODO(SPARK-9957): Handle this properly by filtering out those features. if (numCategories > 1) { // Decide if some categorical features should be treated as unordered features, --- End diff -- Change , to . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149567340 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -631,6 +614,42 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { val expected = Map(0 -> 1.0 / 3.0, 2 -> 2.0 / 3.0) assert(mapToVec(map.toMap) ~== mapToVec(expected) relTol 0.01) } + + test("traverseUnorderedSplits") { + --- End diff -- So how to test all possible splits to make sure the generated splits are all correct ? If tree generated, only best split is remained. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149561550 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging { (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) } else if (binAggregates.metadata.isUnordered(featureIndex)) { // Unordered categorical feature - val leftChildOffset = binAggregates.getFeatureOffset(featureIndexIdx) - val (bestFeatureSplitIndex, bestFeatureGainStats) = -Range(0, numSplits).map { splitIndex => - val leftChildStats = binAggregates.getImpurityCalculator(leftChildOffset, splitIndex) - val rightChildStats = binAggregates.getParentImpurityCalculator() -.subtract(leftChildStats) + val numBins = binAggregates.metadata.numBins(featureIndex) + val featureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + + val binStatsArray = Array.tabulate(numBins) { binIndex => +binAggregates.getImpurityCalculator(featureOffset, binIndex) + } + val parentStats = binAggregates.getParentImpurityCalculator() + + var bestGain = Double.NegativeInfinity + var bestSet: BitSet = null + var bestLeftChildStats: ImpurityCalculator = null + var bestRightChildStats: ImpurityCalculator = null + + traverseUnorderedSplits[ImpurityCalculator](numBins, null, +(stats, binIndex) => { + val binStats = binStatsArray(binIndex) + if (stats == null) { +binStats + } else { +stats.copy.add(binStats) + } +}, +(set, leftChildStats) => { + val rightChildStats = parentStats.copy.subtract(leftChildStats) gainAndImpurityStats = calculateImpurityStats(gainAndImpurityStats, leftChildStats, rightChildStats, binAggregates.metadata) - (splitIndex, gainAndImpurityStats) -}.maxBy(_._2.gain) - (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) + if (gainAndImpurityStats.gain > bestGain) { +bestGain = gainAndImpurityStats.gain +bestSet = set | new BitSet(numBins) // copy set --- End diff -- The class do not support `copy` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149313427 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -631,6 +614,42 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { val expected = Map(0 -> 1.0 / 3.0, 2 -> 2.0 / 3.0) assert(mapToVec(map.toMap) ~== mapToVec(expected) relTol 0.01) } + + test("traverseUnorderedSplits") { + --- End diff -- Since `traverseUnorderedSplits` is a private method, I wonder whether we can check the unorder splits on DecisonTree directly? For example, create a tiny dataset and generate a shallow tree (depth = 1?). I know the test case is difficult (maybe impossible) to design, however it focuses on behavior instead of implementation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149274660 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging { categories } + private[tree] def traverseUnorderedSplits[T]( + arity: Int, + zeroStats: T, + seqOp: (T, Int) => T, + finalizer: (BitSet, T) => Unit): Unit = { +assert(arity > 1) + +// numSplits = (1 << arity - 1) - 1 +val numSplits = DecisionTreeMetadata.numUnorderedSplits(arity) +val subSet: BitSet = new BitSet(arity) + +// dfs traverse +// binIndex: [0, arity) +def dfs(binIndex: Int, combNumber: Int, stats: T): Unit = { + if (binIndex == arity) { +// recursion exit when binIndex == arity +if (combNumber > 0) { + // we get an available unordered split, saved in subSet. + finalizer(subSet, stats) +} + } else { +subSet.set(binIndex) +val leftChildCombNumber = combNumber + (1 << binIndex) +// pruning: only need combNumber satisfy: 1 <= combNumber <= numSplits --- End diff -- Yes. for example: "00101" and "11010" they're equivalent splits, we should traverse only one of them. So here I use the condition `1 <= combNumber <= numSplits` to do the pruning. It can simply filter out another half splits. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149238531 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging { categories } + private[tree] def traverseUnorderedSplits[T]( --- End diff -- Could you please add a docstring for this method, since it's a bit complicated? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149238185 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging { (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) } else if (binAggregates.metadata.isUnordered(featureIndex)) { // Unordered categorical feature - val leftChildOffset = binAggregates.getFeatureOffset(featureIndexIdx) - val (bestFeatureSplitIndex, bestFeatureGainStats) = -Range(0, numSplits).map { splitIndex => - val leftChildStats = binAggregates.getImpurityCalculator(leftChildOffset, splitIndex) - val rightChildStats = binAggregates.getParentImpurityCalculator() -.subtract(leftChildStats) + val numBins = binAggregates.metadata.numBins(featureIndex) + val featureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + + val binStatsArray = Array.tabulate(numBins) { binIndex => +binAggregates.getImpurityCalculator(featureOffset, binIndex) + } + val parentStats = binAggregates.getParentImpurityCalculator() + + var bestGain = Double.NegativeInfinity + var bestSet: BitSet = null + var bestLeftChildStats: ImpurityCalculator = null + var bestRightChildStats: ImpurityCalculator = null + + traverseUnorderedSplits[ImpurityCalculator](numBins, null, --- End diff -- Could you please add a comment explaining what this does? E.g.: `// Computes the best split for the current feature, storing the result across the vars above` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149237212 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging { (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) } else if (binAggregates.metadata.isUnordered(featureIndex)) { // Unordered categorical feature - val leftChildOffset = binAggregates.getFeatureOffset(featureIndexIdx) - val (bestFeatureSplitIndex, bestFeatureGainStats) = -Range(0, numSplits).map { splitIndex => - val leftChildStats = binAggregates.getImpurityCalculator(leftChildOffset, splitIndex) - val rightChildStats = binAggregates.getParentImpurityCalculator() -.subtract(leftChildStats) + val numBins = binAggregates.metadata.numBins(featureIndex) + val featureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + + val binStatsArray = Array.tabulate(numBins) { binIndex => --- End diff -- Could you please add a comment explaining what this is? E.g.: `// Each element of binStatsArray stores pre-computed label statistics for a single bin of the current future` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149242575 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging { categories } + private[tree] def traverseUnorderedSplits[T]( --- End diff -- Also, does `traverseUnorderedSplits` need to take a type parameter / two different closures as method arguments? AFAICT the use of a type parameter/closures here allow us to unit test this functionality on a simple example, but I wonder if we could simplify this somehow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149238123 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging { (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) } else if (binAggregates.metadata.isUnordered(featureIndex)) { // Unordered categorical feature - val leftChildOffset = binAggregates.getFeatureOffset(featureIndexIdx) - val (bestFeatureSplitIndex, bestFeatureGainStats) = -Range(0, numSplits).map { splitIndex => - val leftChildStats = binAggregates.getImpurityCalculator(leftChildOffset, splitIndex) - val rightChildStats = binAggregates.getParentImpurityCalculator() -.subtract(leftChildStats) + val numBins = binAggregates.metadata.numBins(featureIndex) + val featureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + + val binStatsArray = Array.tabulate(numBins) { binIndex => +binAggregates.getImpurityCalculator(featureOffset, binIndex) + } + val parentStats = binAggregates.getParentImpurityCalculator() + + var bestGain = Double.NegativeInfinity + var bestSet: BitSet = null + var bestLeftChildStats: ImpurityCalculator = null + var bestRightChildStats: ImpurityCalculator = null + + traverseUnorderedSplits[ImpurityCalculator](numBins, null, +(stats, binIndex) => { + val binStats = binStatsArray(binIndex) + if (stats == null) { +binStats + } else { +stats.copy.add(binStats) + } +}, +(set, leftChildStats) => { + val rightChildStats = parentStats.copy.subtract(leftChildStats) gainAndImpurityStats = calculateImpurityStats(gainAndImpurityStats, leftChildStats, rightChildStats, binAggregates.metadata) - (splitIndex, gainAndImpurityStats) -}.maxBy(_._2.gain) - (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) + if (gainAndImpurityStats.gain > bestGain) { +bestGain = gainAndImpurityStats.gain +bestSet = set | new BitSet(numBins) // copy set --- End diff -- Why not use `set.copy()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149241680 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging { categories } + private[tree] def traverseUnorderedSplits[T]( + arity: Int, + zeroStats: T, + seqOp: (T, Int) => T, + finalizer: (BitSet, T) => Unit): Unit = { +assert(arity > 1) + +// numSplits = (1 << arity - 1) - 1 +val numSplits = DecisionTreeMetadata.numUnorderedSplits(arity) +val subSet: BitSet = new BitSet(arity) + +// dfs traverse +// binIndex: [0, arity) +def dfs(binIndex: Int, combNumber: Int, stats: T): Unit = { + if (binIndex == arity) { +// recursion exit when binIndex == arity +if (combNumber > 0) { + // we get an available unordered split, saved in subSet. + finalizer(subSet, stats) +} + } else { +subSet.set(binIndex) +val leftChildCombNumber = combNumber + (1 << binIndex) +// pruning: only need combNumber satisfy: 1 <= combNumber <= numSplits --- End diff -- If I understand correctly, the check `if (leftChildCombNumber <= numSplits)` helps us ensure that we consider each split only once, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19666 [SPARK-22451][ML] Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories) ## What changes were proposed in this pull request? We do not need generate all possible splits for unordered features before aggregate, - Change `mixedBinSeqOp` (which running on executor), for each unordered feature, we do the same stat with ordered features. so for unordered features, we only need O(numCategories) space for this feature stat. - After driver side get the aggregate result, generate all possible split combinations, and compute the best split. This will reduce decision tree aggregate size for each unordered feature from `O(2^numCategories)` to `O(numCategories)`, `numCategories` is the arity of this unordered feature. This also reduce the cpu cost in executor side. Reduce time complexity for this unordered feature from `O(numPoints * 2^numCategories)` to `O(numPoints)`. This won't increase time complexity for unordered features best split computing in driver side. ## How was this patch tested? UT added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark improve_decision_tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19666.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19666 commit 93c7e0fdf8a16a4e14adfe1891b0a4dc018f6de8 Author: WeichenXuDate: 2017-11-05T13:35:45Z init pr commit e79abfd2a231ff29abc4be1723002542c85189df Author: WeichenXu Date: 2017-11-06T09:41:04Z add unit test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org