Re: Mahout 0.8 Random Forest Accuracy
Tim, Yes, RF's are ensemble learners, but that doesn't mean that you couldn't wrap them up with other classifiers to have a higher level ensemble. On Sat, Oct 19, 2013 at 6:48 AM, Tim Peut wrote: > Thanks for the info and suggestions everyone. > > On 19 October 2013 01:00, Ted Dunning wrote: > > On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser < > j.barrett.straus...@gmail.com> wrote: > > > How difficult would it be to wrap the RF classifier into an ensemble > > learner? > > > It is callable. Should be relatively easy. > > I'm still becoming familiar with machine learning terminology so please > forgive my ignorance. I thought that random forests are, by nature, > ensemble learners? What exactly do you mean by this? >
Re: Mahout 0.8 Random Forest Accuracy
Thanks for the info and suggestions everyone. On 19 October 2013 01:00, Ted Dunning wrote: On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser < j.barrett.straus...@gmail.com> wrote: > How difficult would it be to wrap the RF classifier into an ensemble > learner? > It is callable. Should be relatively easy. I'm still becoming familiar with machine learning terminology so please forgive my ignorance. I thought that random forests are, by nature, ensemble learners? What exactly do you mean by this?
Re: Mahout 0.8 Random Forest Accuracy
Yes I looked at the impl here, and I think it is aging, since I'm not sure Deneche had time to put in many bells or whistles at the start, and not sure it's been touched much since. My limited experience is that it generally does less clever stuff than R, which in turn is less clever than sklearn et al. hence the gap in results. There are lots of ways you can do better than the original Breiman paper, which is what R sticks to mostly. Weirdly I was just having a long conversation about this exact topic today, since I'm working on an RDF implementation on Hadoop. (I think it might be worth a new implementation after this much time, if one were looking to revamp RDF on Hadoop and inject some new tricks. It needs some different design choices.) Anyway, the question was indeed which splits of an N-valued categorical (nominal) variable to consider? because considering all 2^N-2 of them is not scalable, especially since I don't want any limit on N. There are easy, fast ways to figure out what splits to consider for every other combination of categorical/numeric feature F predicting categorical/numeric target T, but I couldn't find any magic for one: categorical F predicting categorical T. I ended up making up a heuristic that is at least linear in N, and I wonder if anyone is a) interested in talking about this at all or b) has the magic answer here. So -- sort the values of the F by the entropy of T considered over examples for just that value of F. Then consider splits based on prefixes of that list. So if F = [a, b, c, d] and in order by entropy of T they are [b, c, a, d] then consider rules like F in {b}, F in {b,c}, F in {b,c,a}. This isn't a great heuristic but seems to work well in practice. I suppose it's this and a lot of other little tricks like that that could improve this or any other implementation -- RDF makes speed and accuracy pretty trade-off-able, so anything that makes things faster can make it instead more accurate or vice versa. Definitely an interesting topic I'd be interested to cover with anyone building RDFs now. On Fri, Oct 18, 2013 at 7:26 PM, DeBarr, Dave wrote: > Another difference... > > R's randomForest package (which RRF is based on) evaluates subsets of values > when partitioning nominal values. [This is why it complains if there are > more than 32 distinct values for a nominal variable.] > > For example, if our nominal variable has values { A, B, C, D }, the package > will consider "in { A, C }" versus "not in { A, C }" as a partition candidate. > > -Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Friday, October 18, 2013 10:42 AM > To: user@mahout.apache.org > Subject: Re: Mahout 0.8 Random Forest Accuracy > > On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut wrote: > >> Has anyone found that Mahout's random forest doesn't perform as well as >> other implementations? If not, is there any reason why it wouldn't perform >> as well? >> > > This is disappointing, but not entirely surprising. There has been > considerably less effort applied to Mahouts random forest package than the > comparable R packages. > > Note, particularly that the Mahout implementation is not regularized. That > could well be a big difference.
RE: Mahout 0.8 Random Forest Accuracy
Another difference... R's randomForest package (which RRF is based on) evaluates subsets of values when partitioning nominal values. [This is why it complains if there are more than 32 distinct values for a nominal variable.] For example, if our nominal variable has values { A, B, C, D }, the package will consider "in { A, C }" versus "not in { A, C }" as a partition candidate. -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Friday, October 18, 2013 10:42 AM To: user@mahout.apache.org Subject: Re: Mahout 0.8 Random Forest Accuracy On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut wrote: > Has anyone found that Mahout's random forest doesn't perform as well as > other implementations? If not, is there any reason why it wouldn't perform > as well? > This is disappointing, but not entirely surprising. There has been considerably less effort applied to Mahouts random forest package than the comparable R packages. Note, particularly that the Mahout implementation is not regularized. That could well be a big difference.
Re: Mahout 0.8 Random Forest Accuracy
On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser < j.barrett.straus...@gmail.com> wrote: > How difficult would it be to wrap the RF classifier into an ensemble > learner? > It is callable. Should be relatively easy.
Re: Mahout 0.8 Random Forest Accuracy
Just a theoretical note : Accuracy isn't the best metric for a Classifier. Mahouts accuracy could well be less than the comparable R result, but be a better classifier, at least according to the F1 metric. How difficult would it be to wrap the RF classifier into an ensemble learner? -barrett On Fri, Oct 18, 2013 at 10:42 AM, Ted Dunning wrote: > On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut wrote: > > > Has anyone found that Mahout's random forest doesn't perform as well as > > other implementations? If not, is there any reason why it wouldn't > perform > > as well? > > > > This is disappointing, but not entirely surprising. There has been > considerably less effort applied to Mahouts random forest package than the > comparable R packages. > > Note, particularly that the Mahout implementation is not regularized. That > could well be a big difference. > -- https://github.com/bearrito @deepbearrito
Re: Mahout 0.8 Random Forest Accuracy
On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut wrote: > Has anyone found that Mahout's random forest doesn't perform as well as > other implementations? If not, is there any reason why it wouldn't perform > as well? > This is disappointing, but not entirely surprising. There has been considerably less effort applied to Mahouts random forest package than the comparable R packages. Note, particularly that the Mahout implementation is not regularized. That could well be a big difference.
Mahout 0.8 Random Forest Accuracy
Hi all, I'm using the random forest implementation in Mahout 0.8 to perform classification (org.apache.mahout.classifier.df.mapreduce.BuildForest and org.apache.mahout.classifier.df.mapreduce.TestForest). I've run the classifier multiple times with different parameters and different data splits, and consistently get accuracy of ~0.9. I've previously used R's RRF package with the exact same data and I consistently get accuracy of ~0.95, which is a fair bit higher than the Mahout results. I've been unable to figure out why the classifiers perform differently with the same data and the same parameters. Has anyone found that Mahout's random forest doesn't perform as well as other implementations? If not, is there any reason why it wouldn't perform as well? Cheers, Tim