[ https://issues.apache.org/jira/browse/SPARK-22751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-22751: ------------------------------ Priority: Minor (was: Major) It looks somewhat difficult to change this at first glance, but findSplitsForContinuousFeature only ever calls foldLeft and isEmpty on the Iterable it gets from groupByKey. This suggests it can be transformed to handle one element at a time in reduceByKey with some surgery. I am not sure it helps the shuffle size though -- wouldn't it need to shuffle the same info? > Improve ML RandomForest shuffle performance > ------------------------------------------- > > Key: SPARK-22751 > URL: https://issues.apache.org/jira/browse/SPARK-22751 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.2.0 > Reporter: lucio35 > Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > When I try to use ML Randomforest to train a classifier with dataset > news20.binary, which has 19,996 training examples and 1,355,191 features, i > found that shuffle write size( 51 GB ) of findSplitsBySorting is very large > compared with the small data size( 133.52 MB ). I think it is useful to > replace groupByKey by reduceByKey to improve shuffle performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org