[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...
Github user lucio-yz commented on the issue: https://github.com/apache/spark/pull/20472 @srowen Any other problems? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r171184500 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { --- End diff -- have fixed it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r171182634 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging { val numFeatures = metadata.numFeatures val splits: Array[Array[Split]] = Array.tabulate(numFeatures) { case i if metadata.isContinuous(i) => -val split = continuousSplits(i) +// some features may only contains zero, so continuousSplits will not have a record +val split = if (continuousSplits.contains(i)) continuousSplits(i) else Array.empty[Split] --- End diff -- have fixed it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r171182291 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging { val numFeatures = metadata.numFeatures val splits: Array[Array[Split]] = Array.tabulate(numFeatures) { case i if metadata.isContinuous(i) => -val split = continuousSplits(i) +// some features may only contains zero, so continuousSplits will not have a record --- End diff -- got it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r171160692 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt --- End diff -- I have seen the note of function _sample_, and _sample_ does not guarantee to provide exactly the fraction of the count of the given RDD. It seems that requiring _numSamples - partNumSamples_ to be non-negative is a more efficient choice than trigger a _count_. The degree of approximation depends upon the degree approximation of _sample_. And it's sure that the splits will be inaccurate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r166501185 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,19 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val requiredSamples: Long = (samplesFractionForFindSplits(metadata) * metadata.numExamples.toDouble).toLong --- End diff -- Thanks for your comments, and im sorry for my carelessness. I have fixed them all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...
Github user lucio-yz commented on the issue: https://github.com/apache/spark/pull/20472 I tested on 2 datasets: 1. _rcv1.binary_, which has 47,236 dimensions. Before improvement, the shuffle write size in _findSplitsBySorting_ is 1GB. After improvement, the shuffle size is 7.7MB. 2. _news20.binary_, which has 1,355,191 dimensions. Before improvement, the shuffle write size in _findSplitsBySorting_ is 51 GB. After improvement, the shuffle size is 24.1 MB. ps: I tested on a cluster which has 10 nodes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165872418 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +1002,22 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + var requiredSamples: Long = math.max(metadata.maxBins * metadata.maxBins, 1) --- End diff -- I defined a private method _samplesFractionForFindSplits_ in RandomForest.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...
Github user lucio-yz commented on the issue: https://github.com/apache/spark/pull/20472 previous problems have been solved --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165595281 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1002,10 +1008,14 @@ private[spark] object RandomForest extends Logging { val numSplits = metadata.numSplits(featureIndex) // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + var (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { --- End diff -- problem has been solved --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
GitHub user lucio-yz opened a pull request: https://github.com/apache/spark/pull/20472 [SPARK-22751][ML]Improve ML RandomForest shuffle performance ## What changes were proposed in this pull request? As I mentioned in [SPARK-22751](https://issues.apache.org/jira/browse/SPARK-22751?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20ML%20AND%20text%20~%20randomforest), there is a shuffle performance problem in ML Randomforest when train a RF in high dimensional data. The reason is that, in org.apache.spark.tree.impl.RandomForest, the function findSplitsBySorting will actually flatmap a sparse vector into a dense vector, then in groupByKey there will be a huge shuffle write size. To avoid this, we can add a filter after flatmap, to filter out zero value. And in function findSplitsForContinuousFeature, we can infer the number of zero value by pass a parameter numInput to function findSplitsForContinuousFeature. numInput is the number of samples. In addition, if a feature only contains zero value, continuousSplits will not has the key of feature id. So I add a check when using continuousSplits. ## How was this patch tested? Ran model locally using spark-submit. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lucio-yz/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20472.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20472 commit 50cb173dd34dc353c243b97f2686a8c545a03909 Author: lucio <576632108@...> Date: 2018-02-01T09:47:52Z fix mllib randomforest shuffle issue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org