[ https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-21624: ------------------------------------ Assignee: Apache Spark > Optimize communication cost of RF/GBT/DT > ---------------------------------------- > > Key: SPARK-21624 > URL: https://issues.apache.org/jira/browse/SPARK-21624 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 2.3.0 > Reporter: Peng Meng > Assignee: Apache Spark > > {quote}The implementation of RF is bound by either the cost of statistics > computation on workers or by communicating the sufficient statistics.{quote} > The statistics are stored in allStats: > {code:java} > /** > * Flat array of elements. > * Index for start of stats for a (feature, bin) is: > * index = featureOffsets(featureIndex) + binIndex * statsSize > */ > private var allStats: Array[Double] = new Array[Double](allStatsSize) > {code} > The size of allStats maybe very large, and it can be very sparse, especially > on the nodes that near the leave of the tree. > I have changed allStats from Array to SparseVector, my tests show the > communication is down by about 50%. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org