[ https://issues.apache.org/jira/browse/SPARK-14606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-14606: -------------------------------------- Affects Version/s: (was: 1.6.1) (was: 1.5.2) (was: 1.6.0) > Different maxBins value for categorical and continuous features in > RandomForest implementation. > ----------------------------------------------------------------------------------------------- > > Key: SPARK-14606 > URL: https://issues.apache.org/jira/browse/SPARK-14606 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Reporter: Rahul Tanwani > Priority: Minor > > Currently the RandomForest algo takes a single maxBins value to decide the > number of splits to take. This sometimes causes training time to go very high > when there is a single categorical column having sufficiently large number of > unique values. This single column impacts all the numeric (continuous) > columns even though such a high number of splits are not required. > Encoding the categorical column into features make the data very wide and > this requires us to increase the maxMemoryInMB and puts more pressure on the > GC as well. > Keeping the separate maxBins values for categorial and continuous features > should be useful in this regard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org