[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2018-07-09 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/19666


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2018-03-19 Thread tengpeng
Github user tengpeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r175646373
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/DecisionTreeMetadata.scala 
---
@@ -152,15 +152,13 @@ private[spark] object DecisionTreeMetadata extends 
Logging {
 // TODO(SPARK-9957): Handle this properly by filtering out those 
features.
 if (numCategories > 1) {
   // Decide if some categorical features should be treated as 
unordered features,
--- End diff --

Change , to .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2018-03-19 Thread tengpeng
Github user tengpeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r175646335
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/DecisionTreeMetadata.scala 
---
@@ -152,15 +152,13 @@ private[spark] object DecisionTreeMetadata extends 
Logging {
 // TODO(SPARK-9957): Handle this properly by filtering out those 
features.
 if (numCategories > 1) {
   // Decide if some categorical features should be treated as 
unordered features,
--- End diff --

Change , to . 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149567340
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -631,6 +614,42 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val expected = Map(0 -> 1.0 / 3.0, 2 -> 2.0 / 3.0)
 assert(mapToVec(map.toMap) ~== mapToVec(expected) relTol 0.01)
   }
+
+  test("traverseUnorderedSplits") {
+
--- End diff --

So how to test all possible splits to make sure the generated splits are 
all correct ? If tree generated, only best split is remained.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149561550
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging {
   (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
 } else if (binAggregates.metadata.isUnordered(featureIndex)) {
   // Unordered categorical feature
-  val leftChildOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
-  val (bestFeatureSplitIndex, bestFeatureGainStats) =
-Range(0, numSplits).map { splitIndex =>
-  val leftChildStats = 
binAggregates.getImpurityCalculator(leftChildOffset, splitIndex)
-  val rightChildStats = 
binAggregates.getParentImpurityCalculator()
-.subtract(leftChildStats)
+  val numBins = binAggregates.metadata.numBins(featureIndex)
+  val featureOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
+
+  val binStatsArray = Array.tabulate(numBins) { binIndex =>
+binAggregates.getImpurityCalculator(featureOffset, binIndex)
+  }
+  val parentStats = binAggregates.getParentImpurityCalculator()
+
+  var bestGain = Double.NegativeInfinity
+  var bestSet: BitSet = null
+  var bestLeftChildStats: ImpurityCalculator = null
+  var bestRightChildStats: ImpurityCalculator = null
+
+  traverseUnorderedSplits[ImpurityCalculator](numBins, null,
+(stats, binIndex) => {
+  val binStats = binStatsArray(binIndex)
+  if (stats == null) {
+binStats
+  } else {
+stats.copy.add(binStats)
+  }
+},
+(set, leftChildStats) => {
+  val rightChildStats = 
parentStats.copy.subtract(leftChildStats)
   gainAndImpurityStats = 
calculateImpurityStats(gainAndImpurityStats,
 leftChildStats, rightChildStats, binAggregates.metadata)
-  (splitIndex, gainAndImpurityStats)
-}.maxBy(_._2.gain)
-  (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
+  if (gainAndImpurityStats.gain > bestGain) {
+bestGain = gainAndImpurityStats.gain
+bestSet = set | new BitSet(numBins) // copy set
--- End diff --

The class do not support `copy` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-07 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149313427
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -631,6 +614,42 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val expected = Map(0 -> 1.0 / 3.0, 2 -> 2.0 / 3.0)
 assert(mapToVec(map.toMap) ~== mapToVec(expected) relTol 0.01)
   }
+
+  test("traverseUnorderedSplits") {
+
--- End diff --

Since `traverseUnorderedSplits` is a private method, I wonder whether we 
can check the unorder splits on DecisonTree directly? For example, create a 
tiny dataset and generate a shallow tree (depth = 1?). I know the test case is 
difficult (maybe impossible) to design, however it focuses on behavior instead 
of implementation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149274660
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging {
 categories
   }
 
+  private[tree] def traverseUnorderedSplits[T](
+  arity: Int,
+  zeroStats: T,
+  seqOp: (T, Int) => T,
+  finalizer: (BitSet, T) => Unit): Unit = {
+assert(arity > 1)
+
+// numSplits = (1 << arity - 1) - 1
+val numSplits = DecisionTreeMetadata.numUnorderedSplits(arity)
+val subSet: BitSet = new BitSet(arity)
+
+// dfs traverse
+// binIndex: [0, arity)
+def dfs(binIndex: Int, combNumber: Int, stats: T): Unit = {
+  if (binIndex == arity) {
+// recursion exit when binIndex == arity
+if (combNumber > 0) {
+  // we get an available unordered split, saved in subSet.
+  finalizer(subSet, stats)
+}
+  } else {
+subSet.set(binIndex)
+val leftChildCombNumber = combNumber + (1 << binIndex)
+// pruning: only need combNumber satisfy: 1 <= combNumber <= 
numSplits
--- End diff --

Yes. for example: "00101" and "11010" they're equivalent splits, we should 
traverse only one of them.
So here I use the condition `1 <= combNumber <= numSplits` to do the 
pruning. It can simply filter out another half splits.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149238531
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging {
 categories
   }
 
+  private[tree] def traverseUnorderedSplits[T](
--- End diff --

Could you please add a docstring for this method, since it's a bit 
complicated?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149238185
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging {
   (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
 } else if (binAggregates.metadata.isUnordered(featureIndex)) {
   // Unordered categorical feature
-  val leftChildOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
-  val (bestFeatureSplitIndex, bestFeatureGainStats) =
-Range(0, numSplits).map { splitIndex =>
-  val leftChildStats = 
binAggregates.getImpurityCalculator(leftChildOffset, splitIndex)
-  val rightChildStats = 
binAggregates.getParentImpurityCalculator()
-.subtract(leftChildStats)
+  val numBins = binAggregates.metadata.numBins(featureIndex)
+  val featureOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
+
+  val binStatsArray = Array.tabulate(numBins) { binIndex =>
+binAggregates.getImpurityCalculator(featureOffset, binIndex)
+  }
+  val parentStats = binAggregates.getParentImpurityCalculator()
+
+  var bestGain = Double.NegativeInfinity
+  var bestSet: BitSet = null
+  var bestLeftChildStats: ImpurityCalculator = null
+  var bestRightChildStats: ImpurityCalculator = null
+
+  traverseUnorderedSplits[ImpurityCalculator](numBins, null,
--- End diff --

Could you please add a comment explaining what this does? E.g.:
`// Computes the best split for the current feature, storing the result 
across the vars above`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149237212
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging {
   (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
 } else if (binAggregates.metadata.isUnordered(featureIndex)) {
   // Unordered categorical feature
-  val leftChildOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
-  val (bestFeatureSplitIndex, bestFeatureGainStats) =
-Range(0, numSplits).map { splitIndex =>
-  val leftChildStats = 
binAggregates.getImpurityCalculator(leftChildOffset, splitIndex)
-  val rightChildStats = 
binAggregates.getParentImpurityCalculator()
-.subtract(leftChildStats)
+  val numBins = binAggregates.metadata.numBins(featureIndex)
+  val featureOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
+
+  val binStatsArray = Array.tabulate(numBins) { binIndex =>
--- End diff --

Could you please add a comment explaining what this is? E.g.:
`// Each element of binStatsArray stores pre-computed label statistics for 
a single bin of the current future`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149242575
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging {
 categories
   }
 
+  private[tree] def traverseUnorderedSplits[T](
--- End diff --

Also, does `traverseUnorderedSplits` need to take a type parameter / two 
different closures as method arguments? AFAICT the use of a type 
parameter/closures here allow us to unit test this functionality on a simple 
example, but I wonder if we could simplify this somehow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149238123
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -741,17 +678,43 @@ private[spark] object RandomForest extends Logging {
   (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
 } else if (binAggregates.metadata.isUnordered(featureIndex)) {
   // Unordered categorical feature
-  val leftChildOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
-  val (bestFeatureSplitIndex, bestFeatureGainStats) =
-Range(0, numSplits).map { splitIndex =>
-  val leftChildStats = 
binAggregates.getImpurityCalculator(leftChildOffset, splitIndex)
-  val rightChildStats = 
binAggregates.getParentImpurityCalculator()
-.subtract(leftChildStats)
+  val numBins = binAggregates.metadata.numBins(featureIndex)
+  val featureOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
+
+  val binStatsArray = Array.tabulate(numBins) { binIndex =>
+binAggregates.getImpurityCalculator(featureOffset, binIndex)
+  }
+  val parentStats = binAggregates.getParentImpurityCalculator()
+
+  var bestGain = Double.NegativeInfinity
+  var bestSet: BitSet = null
+  var bestLeftChildStats: ImpurityCalculator = null
+  var bestRightChildStats: ImpurityCalculator = null
+
+  traverseUnorderedSplits[ImpurityCalculator](numBins, null,
+(stats, binIndex) => {
+  val binStats = binStatsArray(binIndex)
+  if (stats == null) {
+binStats
+  } else {
+stats.copy.add(binStats)
+  }
+},
+(set, leftChildStats) => {
+  val rightChildStats = 
parentStats.copy.subtract(leftChildStats)
   gainAndImpurityStats = 
calculateImpurityStats(gainAndImpurityStats,
 leftChildStats, rightChildStats, binAggregates.metadata)
-  (splitIndex, gainAndImpurityStats)
-}.maxBy(_._2.gain)
-  (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
+  if (gainAndImpurityStats.gain > bestGain) {
+bestGain = gainAndImpurityStats.gain
+bestSet = set | new BitSet(numBins) // copy set
--- End diff --

Why not use `set.copy()`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149241680
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -976,6 +930,44 @@ private[spark] object RandomForest extends Logging {
 categories
   }
 
+  private[tree] def traverseUnorderedSplits[T](
+  arity: Int,
+  zeroStats: T,
+  seqOp: (T, Int) => T,
+  finalizer: (BitSet, T) => Unit): Unit = {
+assert(arity > 1)
+
+// numSplits = (1 << arity - 1) - 1
+val numSplits = DecisionTreeMetadata.numUnorderedSplits(arity)
+val subSet: BitSet = new BitSet(arity)
+
+// dfs traverse
+// binIndex: [0, arity)
+def dfs(binIndex: Int, combNumber: Int, stats: T): Unit = {
+  if (binIndex == arity) {
+// recursion exit when binIndex == arity
+if (combNumber > 0) {
+  // we get an available unordered split, saved in subSet.
+  finalizer(subSet, stats)
+}
+  } else {
+subSet.set(binIndex)
+val leftChildCombNumber = combNumber + (1 << binIndex)
+// pruning: only need combNumber satisfy: 1 <= combNumber <= 
numSplits
--- End diff --

If I understand correctly, the check `if (leftChildCombNumber <= 
numSplits)` helps us ensure that we consider each split only once, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/19666

[SPARK-22451][ML] Reduce decision tree aggregate size for unordered 
features from O(2^numCategories) to O(numCategories)

## What changes were proposed in this pull request?

We do not need generate all possible splits for unordered features before 
aggregate,

- Change `mixedBinSeqOp` (which running on executor), for each unordered 
feature, we do the same stat with ordered features. so for unordered features, 
we only need O(numCategories) space for this feature stat.

- After driver side get the aggregate result, generate all possible split 
combinations, and compute the best split.

This will reduce decision tree aggregate size for each unordered feature 
from `O(2^numCategories)` to `O(numCategories)`, `numCategories` is the arity 
of this unordered feature.

This also reduce the cpu cost in executor side. Reduce time complexity for 
this unordered feature from `O(numPoints * 2^numCategories)` to `O(numPoints)`.

This won't increase time complexity for unordered features best split 
computing in driver side.

## How was this patch tested?

UT added.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark improve_decision_tree

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19666.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19666


commit 93c7e0fdf8a16a4e14adfe1891b0a4dc018f6de8
Author: WeichenXu 
Date:   2017-11-05T13:35:45Z

init pr

commit e79abfd2a231ff29abc4be1723002542c85189df
Author: WeichenXu 
Date:   2017-11-06T09:41:04Z

add unit test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org