Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2780#issuecomment-59590801
  
    @chouqin  Sorry for the slow response!
    
    About the RandomForestSuite failure: The change to fix the failure 
(maxBins) is OK with me.  It is a somewhat brittle test.  Good point about the 
first threshold being wasted.
    
    About the histogram method’s speed: I would guess that the extra 
computation will not be that bad.  Even if maxBins grows, I would expect the 
runtime of the whole algorithm to slow down as well, and the number of samples 
is capped at 10000.  I will run some tests though to make sure.
    
    About the histogram method’s references: The PLANET paper uses 
“equidepth” histograms, citing the paper below.  Looking at that paper, 
“equidepth” means the same method which @manishamde implemented previously. 
 I will look into this a little more to see if I find a match for the method 
you implemented.
    * PLANET paper: “PLANET: Massively Parallel Learning of Tree Ensembles 
with MapReduce”
    * Paper they cite for histograms: G. S. Manku, S. Rajagopalan, and B. G. 
Lindsay. Random sampling techniques for space efficient online computation of 
order statistics of large datasets. In International Conference on ACM Special 
Interest Group on Management of Data (SIGMOD), pages 251–262, 1999.
    
    I’ll make a pass now and add comments.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to