[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

2018-03-05 Thread lucio-yz
Github user lucio-yz commented on the issue:

https://github.com/apache/spark/pull/20472
  
@srowen Any other problems?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-28 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r171184500
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
--- End diff --

have fixed it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-28 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r171182634
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
 val numFeatures = metadata.numFeatures
 val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
   case i if metadata.isContinuous(i) =>
-val split = continuousSplits(i)
+// some features may only contains zero, so continuousSplits will 
not have a record
+val split = if (continuousSplits.contains(i)) continuousSplits(i) 
else Array.empty[Split]
--- End diff --

have fixed it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-28 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r171182291
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
 val numFeatures = metadata.numFeatures
 val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
   case i if metadata.isContinuous(i) =>
-val split = continuousSplits(i)
+// some features may only contains zero, so continuousSplits will 
not have a record
--- End diff --

got it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-27 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r171160692
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val numSamples: Int = (samplesFractionForFindSplits(metadata) * 
metadata.numExamples).toInt
--- End diff --

I have seen the note of function _sample_, and _sample_ does not guarantee 
to provide exactly the fraction of the count of the given RDD. It seems that 
requiring _numSamples - partNumSamples_ to be non-negative is a more efficient 
choice than trigger a _count_. The degree of approximation depends upon the 
degree approximation of _sample_. And it's sure that the splits will be 
inaccurate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-06 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r166501185
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,19 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val requiredSamples: Long = (samplesFractionForFindSplits(metadata) 
* metadata.numExamples.toDouble).toLong
--- End diff --

Thanks for your comments, and im sorry for my carelessness. I have fixed 
them all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

2018-02-05 Thread lucio-yz
Github user lucio-yz commented on the issue:

https://github.com/apache/spark/pull/20472
  
I tested on 2 datasets:

1.  _rcv1.binary_, which has 47,236 dimensions. Before improvement, the 
shuffle write size in _findSplitsBySorting_ is 1GB. After improvement, the 
shuffle size is 7.7MB.

2. _news20.binary_, which has 1,355,191 dimensions. Before improvement, the 
shuffle write size in _findSplitsBySorting_ is 51 GB. After improvement, the 
shuffle size is 24.1 MB.

ps: I tested on a cluster which has 10 nodes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-04 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165872418
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +1002,22 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  var requiredSamples: Long = math.max(metadata.maxBins * 
metadata.maxBins, 1)
--- End diff --

I defined a private method _samplesFractionForFindSplits_ in 
RandomForest.scala


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

2018-02-02 Thread lucio-yz
Github user lucio-yz commented on the issue:

https://github.com/apache/spark/pull/20472
  
previous problems have been solved


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-02 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165595281
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1002,10 +1008,14 @@ private[spark] object RandomForest extends Logging {
   val numSplits = metadata.numSplits(featureIndex)
 
   // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  var (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
--- End diff --

problem has been solved


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-01 Thread lucio-yz
GitHub user lucio-yz opened a pull request:

https://github.com/apache/spark/pull/20472

[SPARK-22751][ML]Improve ML RandomForest shuffle performance

## What changes were proposed in this pull request?

As I mentioned in 
[SPARK-22751](https://issues.apache.org/jira/browse/SPARK-22751?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20ML%20AND%20text%20~%20randomforest),
 there is a shuffle performance problem in ML Randomforest when train a RF in 
high dimensional data. 

The reason is that, in org.apache.spark.tree.impl.RandomForest, the 
function findSplitsBySorting will actually flatmap a sparse vector into a dense 
vector, then in groupByKey there will be a huge shuffle write size.

To avoid this, we can add a filter after flatmap, to filter out zero value. 
And in function findSplitsForContinuousFeature, we can infer the number of zero 
value by pass a parameter numInput to function findSplitsForContinuousFeature. 
numInput is the number of samples.

In addition, if a feature only contains zero value, continuousSplits will 
not has the key of feature id. So I add a check when using continuousSplits.

## How was this patch tested?
Ran model locally using spark-submit.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lucio-yz/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20472


commit 50cb173dd34dc353c243b97f2686a8c545a03909
Author: lucio <576632108@...>
Date:   2018-02-01T09:47:52Z

fix mllib randomforest shuffle issue




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org