Yes I looked at the impl here, and I think it is aging, since I'm not sure Deneche had time to put in many bells or whistles at the start, and not sure it's been touched much since.
My limited experience is that it generally does less clever stuff than R, which in turn is less clever than sklearn et al. hence the gap in results. There are lots of ways you can do better than the original Breiman paper, which is what R sticks to mostly. Weirdly I was just having a long conversation about this exact topic today, since I'm working on an RDF implementation on Hadoop. (I think it might be worth a new implementation after this much time, if one were looking to revamp RDF on Hadoop and inject some new tricks. It needs some different design choices.) Anyway, the question was indeed which splits of an N-valued categorical (nominal) variable to consider? because considering all 2^N-2 of them is not scalable, especially since I don't want any limit on N. There are easy, fast ways to figure out what splits to consider for every other combination of categorical/numeric feature F predicting categorical/numeric target T, but I couldn't find any magic for one: categorical F predicting categorical T. I ended up making up a heuristic that is at least linear in N, and I wonder if anyone is a) interested in talking about this at all or b) has the magic answer here. So -- sort the values of the F by the entropy of T considered over examples for just that value of F. Then consider splits based on prefixes of that list. So if F = [a, b, c, d] and in order by entropy of T they are [b, c, a, d] then consider rules like F in {b}, F in {b,c}, F in {b,c,a}. This isn't a great heuristic but seems to work well in practice. I suppose it's this and a lot of other little tricks like that that could improve this or any other implementation -- RDF makes speed and accuracy pretty trade-off-able, so anything that makes things faster can make it instead more accurate or vice versa. Definitely an interesting topic I'd be interested to cover with anyone building RDFs now. On Fri, Oct 18, 2013 at 7:26 PM, DeBarr, Dave <deb...@mitre.org> wrote: > Another difference... > > R's randomForest package (which RRF is based on) evaluates subsets of values > when partitioning nominal values. [This is why it complains if there are > more than 32 distinct values for a nominal variable.] > > For example, if our nominal variable has values { A, B, C, D }, the package > will consider "in { A, C }" versus "not in { A, C }" as a partition candidate. > > -----Original Message----- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Friday, October 18, 2013 10:42 AM > To: user@mahout.apache.org > Subject: Re: Mahout 0.8 Random Forest Accuracy > > On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut <t...@timpeut.com> wrote: > >> Has anyone found that Mahout's random forest doesn't perform as well as >> other implementations? If not, is there any reason why it wouldn't perform >> as well? >> > > This is disappointing, but not entirely surprising. There has been > considerably less effort applied to Mahouts random forest package than the > comparable R packages. > > Note, particularly that the Mahout implementation is not regularized. That > could well be a big difference.