Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2780#issuecomment-59590801 @chouqin Sorry for the slow response! About the RandomForestSuite failure: The change to fix the failure (maxBins) is OK with me. It is a somewhat brittle test. Good point about the first threshold being wasted. About the histogram methodâs speed: I would guess that the extra computation will not be that bad. Even if maxBins grows, I would expect the runtime of the whole algorithm to slow down as well, and the number of samples is capped at 10000. I will run some tests though to make sure. About the histogram methodâs references: The PLANET paper uses âequidepthâ histograms, citing the paper below. Looking at that paper, âequidepthâ means the same method which @manishamde implemented previously. I will look into this a little more to see if I find a match for the method you implemented. * PLANET paper: âPLANET: Massively Parallel Learning of Tree Ensembles with MapReduceâ * Paper they cite for histograms: G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In International Conference on ACM Special Interest Group on Management of Data (SIGMOD), pages 251â262, 1999. Iâll make a pass now and add comments.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org