[ 
https://issues.apache.org/jira/browse/SPARK-22751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285858#comment-16285858
 ] 

lucio35 commented on SPARK-22751:
---------------------------------

The reason i suggest reduceByKey is that we only need valueCountMap in 
findSplitsForContinuousFeature, and reduceByKey can combine output with a 
common key on each partition before shuffling the data, when we use (idx, 
point.features(idx)) as the key.
But unfortunately the type of point.features(idx) is Double, so the kv mostly 
has the value 1( it is difficult that two Double value are equal ). 
Thus except using reduceByKey, maybe we need some method to merge part of data 
without affect finding splits.

> Improve ML RandomForest shuffle performance
> -------------------------------------------
>
>                 Key: SPARK-22751
>                 URL: https://issues.apache.org/jira/browse/SPARK-22751
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: lucio35
>            Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When I try to use ML Randomforest to train a classifier with dataset 
> news20.binary, which has 19,996 training examples and 1,355,191 features, i 
> found that shuffle write size( 51 GB ) of findSplitsBySorting is very large 
> compared with the small data size( 133.52 MB ). I think it is useful to 
> replace groupByKey by reduceByKey to improve shuffle performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to