[ 
https://issues.apache.org/jira/browse/SPARK-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195686#comment-15195686
 ] 

Nan Zhu commented on SPARK-13868:
---------------------------------

FYI, we released a solution to integrate XGBoost with Spark directly 

http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html

> Random forest accuracy exploration
> ----------------------------------
>
>                 Key: SPARK-13868
>                 URL: https://issues.apache.org/jira/browse/SPARK-13868
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> This is a JIRA for exploring accuracy improvements for Random Forests.
> h2. Background
> Initial exploration was based on reports of poor accuracy from 
> [http://datascience.la/benchmarking-random-forest-implementations/]
> Essentially, Spark 1.2 showed poor performance relative to other libraries 
> for training set sizes of 1M and 10M.
> h3.  Initial improvements
> The biggest issue was that the metric being used was AUC and Spark 1.2 was 
> using hard predictions, not class probabilities.  This was fixed in 
> [SPARK-9528], and that brought Spark up to performance parity with 
> scikit-learn, Vowpal Wabbit, and R for the training set size of 1M.
> h3.  Remaining issues
> For training set size 10M, Spark does not yet match the AUC of the other 2 
> libraries benchmarked (H2O and xgboost).
> Note that, on 1M instances, these 2 libraries also show better results than 
> scikit-learn, VW, and R.  I'm not too familiar with the H2O implementation 
> and how it differs, but xgboost is a very different algorithm, so it's not 
> surprising it has different behavior.
> h2. My explorations
> I've run Spark on the test set of 10M instances.  (Note that the benchmark 
> linked above used somewhat different settings for the different algorithms, 
> but those settings are actually not that important for this problem.  This 
> included gini vs. entropy impurity and limits on splitting nodes.)
> I've tried adjusting:
> * maxDepth: Past depth 20, going deeper does not seem to matter
> * maxBins: I've gone up to 500, but this too does not seem to matter.  
> However, this is a hard thing to verify since slight differences in 
> discretization could become significant in a large tree.
> h2. Current questions
> * H2O: It would be good to understand how this implementation differs from 
> standard RF implementations (in R, VW, scikit-learn, and Spark).
> * xgboost: There's a JIRA for it: [SPARK-8547].  It would be great to see the 
> Spark package linked from that JIRA tested vs. MLlib on the benchmark data 
> (or other data).  From what I've heard/read, xgboost is sometimes better, 
> sometimes worse in accuracy (but of course faster with more localized 
> training).
> * Based on the above explorations, are there changes we should make to Spark 
> RFs?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to