Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Ted Dunning
On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut t...@timpeut.com wrote:

 Has anyone found that Mahout's random forest doesn't perform as well as
 other implementations? If not, is there any reason why it wouldn't perform
 as well?


This is disappointing, but not entirely surprising.  There has been
considerably less effort applied to Mahouts random forest package than the
comparable R packages.

Note, particularly that the Mahout implementation is not regularized.  That
could well be a big difference.


Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Ted Dunning
On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser 
j.barrett.straus...@gmail.com wrote:

 How difficult would it be to wrap the RF classifier into an ensemble
 learner?


It is callable.  Should be relatively easy.


RE: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread DeBarr, Dave
Another difference...

R's randomForest package (which RRF is based on) evaluates subsets of values 
when partitioning nominal values.  [This is why it complains if there are more 
than 32 distinct values for a nominal variable.]

For example, if our nominal variable has values { A, B, C, D }, the package 
will consider in { A, C } versus not in { A, C } as a partition candidate.

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Friday, October 18, 2013 10:42 AM
To: user@mahout.apache.org
Subject: Re: Mahout 0.8 Random Forest Accuracy

On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut t...@timpeut.com wrote:

 Has anyone found that Mahout's random forest doesn't perform as well as
 other implementations? If not, is there any reason why it wouldn't perform
 as well?


This is disappointing, but not entirely surprising.  There has been
considerably less effort applied to Mahouts random forest package than the
comparable R packages.

Note, particularly that the Mahout implementation is not regularized.  That
could well be a big difference.


Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Sean Owen
Yes I looked at the impl here, and I think it is aging, since I'm not
sure Deneche had time to put in many bells or whistles at the start,
and not sure it's been touched much since.

My limited experience is that it generally does less clever stuff than
R, which in turn is less clever than sklearn et al. hence the gap in
results. There are lots of ways you can do better than the original
Breiman paper, which is what R sticks to mostly.

Weirdly I was just having a long conversation about this exact topic
today, since I'm working on an RDF implementation on Hadoop. (I think
it might be worth a new implementation after this much time, if one
were looking to revamp RDF on Hadoop and inject some new tricks. It
needs some different design choices.)


Anyway, the question was indeed which splits of an N-valued
categorical (nominal) variable to consider? because considering all
2^N-2 of them is not scalable, especially since I don't want any limit
on N.

There are easy, fast ways to figure out what splits to consider for
every other combination of categorical/numeric feature F predicting
categorical/numeric target T, but I couldn't find any magic for one:
categorical F predicting categorical T.

I ended up making up a heuristic that is at least linear in N, and I
wonder if anyone is a) interested in talking about this at all or b)
has the magic answer here.

So -- sort the values of the F by the entropy of T considered over
examples for just that value of F. Then consider splits based on
prefixes of that list. So if F = [a, b, c, d] and in order by entropy
of T they are [b, c, a, d] then consider rules like F in {b}, F in
{b,c}, F in {b,c,a}.

This isn't a great heuristic but seems to work well in practice.


I suppose it's this and a lot of other little tricks like that that
could improve this or any other implementation -- RDF makes speed and
accuracy pretty trade-off-able, so anything that makes things faster
can make it instead more accurate or vice versa.

Definitely an interesting topic I'd be interested to cover with anyone
building RDFs now.


On Fri, Oct 18, 2013 at 7:26 PM, DeBarr, Dave deb...@mitre.org wrote:
 Another difference...

 R's randomForest package (which RRF is based on) evaluates subsets of values 
 when partitioning nominal values.  [This is why it complains if there are 
 more than 32 distinct values for a nominal variable.]

 For example, if our nominal variable has values { A, B, C, D }, the package 
 will consider in { A, C } versus not in { A, C } as a partition candidate.

 -Original Message-
 From: Ted Dunning [mailto:ted.dunn...@gmail.com]
 Sent: Friday, October 18, 2013 10:42 AM
 To: user@mahout.apache.org
 Subject: Re: Mahout 0.8 Random Forest Accuracy

 On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut t...@timpeut.com wrote:

 Has anyone found that Mahout's random forest doesn't perform as well as
 other implementations? If not, is there any reason why it wouldn't perform
 as well?


 This is disappointing, but not entirely surprising.  There has been
 considerably less effort applied to Mahouts random forest package than the
 comparable R packages.

 Note, particularly that the Mahout implementation is not regularized.  That
 could well be a big difference.


Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Tim Peut
Thanks for the info and suggestions everyone.

On 19 October 2013 01:00, Ted Dunning ted.dunn...@gmail.com wrote:

On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser 
j.barrett.straus...@gmail.com wrote:

 How difficult would it be to wrap the RF classifier into an ensemble
 learner?

It is callable.  Should be relatively easy.

I'm still becoming familiar with machine learning terminology so please
forgive my ignorance. I thought that random forests are, by nature,
ensemble learners? What exactly do you mean by this?