[jira] [Updated] (SPARK-22751) Improve ML RandomForest shuffle performance

Sean Owen (JIRA) Mon, 11 Dec 2017 04:05:54 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-22751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen updated SPARK-22751:
------------------------------
    Priority: Minor  (was: Major)

It looks somewhat difficult to change this at first glance, but 
findSplitsForContinuousFeature only ever calls foldLeft and isEmpty on the 
Iterable it gets from groupByKey. This suggests it can be transformed to handle 
one element at a time in reduceByKey with some surgery. I am not sure it helps 
the shuffle size though -- wouldn't it need to shuffle the same info?

> Improve ML RandomForest shuffle performance
> -------------------------------------------
>
>                 Key: SPARK-22751
>                 URL: https://issues.apache.org/jira/browse/SPARK-22751
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: lucio35
>            Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When I try to use ML Randomforest to train a classifier with dataset 
> news20.binary, which has 19,996 training examples and 1,355,191 features, i 
> found that shuffle write size( 51 GB ) of findSplitsBySorting is very large 
> compared with the small data size( 133.52 MB ). I think it is useful to 
> replace groupByKey by reduceByKey to improve shuffle performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22751) Improve ML RandomForest shuffle performance

Reply via email to