[ https://issues.apache.org/jira/browse/SPARK-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195686#comment-15195686 ]
Nan Zhu commented on SPARK-13868: --------------------------------- FYI, we released a solution to integrate XGBoost with Spark directly http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html > Random forest accuracy exploration > ---------------------------------- > > Key: SPARK-13868 > URL: https://issues.apache.org/jira/browse/SPARK-13868 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: Joseph K. Bradley > > This is a JIRA for exploring accuracy improvements for Random Forests. > h2. Background > Initial exploration was based on reports of poor accuracy from > [http://datascience.la/benchmarking-random-forest-implementations/] > Essentially, Spark 1.2 showed poor performance relative to other libraries > for training set sizes of 1M and 10M. > h3. Initial improvements > The biggest issue was that the metric being used was AUC and Spark 1.2 was > using hard predictions, not class probabilities. This was fixed in > [SPARK-9528], and that brought Spark up to performance parity with > scikit-learn, Vowpal Wabbit, and R for the training set size of 1M. > h3. Remaining issues > For training set size 10M, Spark does not yet match the AUC of the other 2 > libraries benchmarked (H2O and xgboost). > Note that, on 1M instances, these 2 libraries also show better results than > scikit-learn, VW, and R. I'm not too familiar with the H2O implementation > and how it differs, but xgboost is a very different algorithm, so it's not > surprising it has different behavior. > h2. My explorations > I've run Spark on the test set of 10M instances. (Note that the benchmark > linked above used somewhat different settings for the different algorithms, > but those settings are actually not that important for this problem. This > included gini vs. entropy impurity and limits on splitting nodes.) > I've tried adjusting: > * maxDepth: Past depth 20, going deeper does not seem to matter > * maxBins: I've gone up to 500, but this too does not seem to matter. > However, this is a hard thing to verify since slight differences in > discretization could become significant in a large tree. > h2. Current questions > * H2O: It would be good to understand how this implementation differs from > standard RF implementations (in R, VW, scikit-learn, and Spark). > * xgboost: There's a JIRA for it: [SPARK-8547]. It would be great to see the > Spark package linked from that JIRA tested vs. MLlib on the benchmark data > (or other data). From what I've heard/read, xgboost is sometimes better, > sometimes worse in accuracy (but of course faster with more localized > training). > * Based on the above explorations, are there changes we should make to Spark > RFs? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org