[
https://issues.apache.org/jira/browse/SPARK-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195686#comment-15195686
]
Nan Zhu commented on SPARK-13868:
---------------------------------
FYI, we released a solution to integrate XGBoost with Spark directly
http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
> Random forest accuracy exploration
> ----------------------------------
>
> Key: SPARK-13868
> URL: https://issues.apache.org/jira/browse/SPARK-13868
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: Joseph K. Bradley
>
> This is a JIRA for exploring accuracy improvements for Random Forests.
> h2. Background
> Initial exploration was based on reports of poor accuracy from
> [http://datascience.la/benchmarking-random-forest-implementations/]
> Essentially, Spark 1.2 showed poor performance relative to other libraries
> for training set sizes of 1M and 10M.
> h3. Initial improvements
> The biggest issue was that the metric being used was AUC and Spark 1.2 was
> using hard predictions, not class probabilities. This was fixed in
> [SPARK-9528], and that brought Spark up to performance parity with
> scikit-learn, Vowpal Wabbit, and R for the training set size of 1M.
> h3. Remaining issues
> For training set size 10M, Spark does not yet match the AUC of the other 2
> libraries benchmarked (H2O and xgboost).
> Note that, on 1M instances, these 2 libraries also show better results than
> scikit-learn, VW, and R. I'm not too familiar with the H2O implementation
> and how it differs, but xgboost is a very different algorithm, so it's not
> surprising it has different behavior.
> h2. My explorations
> I've run Spark on the test set of 10M instances. (Note that the benchmark
> linked above used somewhat different settings for the different algorithms,
> but those settings are actually not that important for this problem. This
> included gini vs. entropy impurity and limits on splitting nodes.)
> I've tried adjusting:
> * maxDepth: Past depth 20, going deeper does not seem to matter
> * maxBins: I've gone up to 500, but this too does not seem to matter.
> However, this is a hard thing to verify since slight differences in
> discretization could become significant in a large tree.
> h2. Current questions
> * H2O: It would be good to understand how this implementation differs from
> standard RF implementations (in R, VW, scikit-learn, and Spark).
> * xgboost: There's a JIRA for it: [SPARK-8547]. It would be great to see the
> Spark package linked from that JIRA tested vs. MLlib on the benchmark data
> (or other data). From what I've heard/read, xgboost is sometimes better,
> sometimes worse in accuracy (but of course faster with more localized
> training).
> * Based on the above explorations, are there changes we should make to Spark
> RFs?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]