[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

jkbradley Fri, 17 Oct 2014 17:01:27 -0700

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2780#issuecomment-59590801
  
    @chouqin  Sorry for the slow response!
    
    About the RandomForestSuite failure: The change to fix the failure 
(maxBins) is OK with me.  It is a somewhat brittle test.  Good point about the 
first threshold being wasted.
    
    About the histogram methodâs speed: I would guess that the extra 
computation will not be that bad.  Even if maxBins grows, I would expect the 
runtime of the whole algorithm to slow down as well, and the number of samples 
is capped at 10000.  I will run some tests though to make sure.
    
    About the histogram methodâs references: The PLANET paper uses 
âequidepthâ histograms, citing the paper below.  Looking at that paper, 
âequidepthâ means the same method which @manishamde implemented previously. 
 I will look into this a little more to see if I find a match for the method 
you implemented.
    * PLANET paper: âPLANET: Massively Parallel Learning of Tree Ensembles 
with MapReduceâ
    * Paper they cite for histograms: G. S. Manku, S. Rajagopalan, and B. G. 
Lindsay. Random sampling techniques for space efficient online computation of 
order statistics of large datasets. In International Conference on ACM Special 
Interest Group on Management of Data (SIGMOD), pages 251â262, 1999.
    
    Iâll make a pass now and add comments.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

Reply via email to