subject:"Mahout 0.8 Random Forest Accuracy"

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Ted Dunning

Tim,

Yes, RF's are ensemble learners, but that doesn't mean that you couldn't
wrap them up with other classifiers to have a higher level ensemble.


On Sat, Oct 19, 2013 at 6:48 AM, Tim Peut  wrote:

> Thanks for the info and suggestions everyone.
>
> On 19 October 2013 01:00, Ted Dunning  wrote:
>
> On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser <
> j.barrett.straus...@gmail.com> wrote:
>
> > How difficult would it be to wrap the RF classifier into an ensemble
> > learner?
> >
> It is callable.  Should be relatively easy.
>
> I'm still becoming familiar with machine learning terminology so please
> forgive my ignorance. I thought that random forests are, by nature,
> ensemble learners? What exactly do you mean by this?
>

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Tim Peut

Thanks for the info and suggestions everyone.

On 19 October 2013 01:00, Ted Dunning  wrote:

On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser <
j.barrett.straus...@gmail.com> wrote:

> How difficult would it be to wrap the RF classifier into an ensemble
> learner?
>
It is callable.  Should be relatively easy.

I'm still becoming familiar with machine learning terminology so please
forgive my ignorance. I thought that random forests are, by nature,
ensemble learners? What exactly do you mean by this?

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Sean Owen

Yes I looked at the impl here, and I think it is aging, since I'm not
sure Deneche had time to put in many bells or whistles at the start,
and not sure it's been touched much since.

My limited experience is that it generally does less clever stuff than
R, which in turn is less clever than sklearn et al. hence the gap in
results. There are lots of ways you can do better than the original
Breiman paper, which is what R sticks to mostly.

Weirdly I was just having a long conversation about this exact topic
today, since I'm working on an RDF implementation on Hadoop. (I think
it might be worth a new implementation after this much time, if one
were looking to revamp RDF on Hadoop and inject some new tricks. It
needs some different design choices.)

Anyway, the question was indeed which splits of an N-valued
categorical (nominal) variable to consider? because considering all
2^N-2 of them is not scalable, especially since I don't want any limit
on N.

There are easy, fast ways to figure out what splits to consider for
every other combination of categorical/numeric feature F predicting
categorical/numeric target T, but I couldn't find any magic for one:
categorical F predicting categorical T.

I ended up making up a heuristic that is at least linear in N, and I
wonder if anyone is a) interested in talking about this at all or b)
has the magic answer here.

So -- sort the values of the F by the entropy of T considered over
examples for just that value of F. Then consider splits based on
prefixes of that list. So if F = [a, b, c, d] and in order by entropy
of T they are [b, c, a, d] then consider rules like F in {b}, F in
{b,c}, F in {b,c,a}.

This isn't a great heuristic but seems to work well in practice.

I suppose it's this and a lot of other little tricks like that that
could improve this or any other implementation -- RDF makes speed and
accuracy pretty trade-off-able, so anything that makes things faster
can make it instead more accurate or vice versa.

Definitely an interesting topic I'd be interested to cover with anyone
building RDFs now.

On Fri, Oct 18, 2013 at 7:26 PM, DeBarr, Dave  wrote:
> Another difference...
>
> R's randomForest package (which RRF is based on) evaluates subsets of values 
> when partitioning nominal values.  [This is why it complains if there are 
> more than 32 distinct values for a nominal variable.]
>
> For example, if our nominal variable has values { A, B, C, D }, the package 
> will consider "in { A, C }" versus "not in { A, C }" as a partition candidate.
>
> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Friday, October 18, 2013 10:42 AM
> To: user@mahout.apache.org
> Subject: Re: Mahout 0.8 Random Forest Accuracy
>
> On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut  wrote:
>
>> Has anyone found that Mahout's random forest doesn't perform as well as
>> other implementations? If not, is there any reason why it wouldn't perform
>> as well?
>>
>
> This is disappointing, but not entirely surprising.  There has been
> considerably less effort applied to Mahouts random forest package than the
> comparable R packages.
>
> Note, particularly that the Mahout implementation is not regularized.  That
> could well be a big difference.

RE: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread DeBarr, Dave

Another difference...

R's randomForest package (which RRF is based on) evaluates subsets of values 
when partitioning nominal values.  [This is why it complains if there are more 
than 32 distinct values for a nominal variable.]

For example, if our nominal variable has values { A, B, C, D }, the package 
will consider "in { A, C }" versus "not in { A, C }" as a partition candidate.

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Friday, October 18, 2013 10:42 AM
To: user@mahout.apache.org
Subject: Re: Mahout 0.8 Random Forest Accuracy

On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut  wrote:

> Has anyone found that Mahout's random forest doesn't perform as well as
> other implementations? If not, is there any reason why it wouldn't perform
> as well?
>

This is disappointing, but not entirely surprising.  There has been
considerably less effort applied to Mahouts random forest package than the
comparable R packages.

Note, particularly that the Mahout implementation is not regularized.  That
could well be a big difference.

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Ted Dunning

On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser <
j.barrett.straus...@gmail.com> wrote:

> How difficult would it be to wrap the RF classifier into an ensemble
> learner?
>

It is callable.  Should be relatively easy.

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread j.barrett Strausser

Just a theoretical note : Accuracy isn't the best metric for a Classifier.
Mahouts accuracy could well be less than the comparable R result, but be a
better classifier, at least according to the F1 metric.

How difficult would it be to wrap the RF classifier into an ensemble
learner?

-barrett

On Fri, Oct 18, 2013 at 10:42 AM, Ted Dunning  wrote:

> On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut  wrote:
>
> > Has anyone found that Mahout's random forest doesn't perform as well as
> > other implementations? If not, is there any reason why it wouldn't
> perform
> > as well?
> >
>
> This is disappointing, but not entirely surprising.  There has been
> considerably less effort applied to Mahouts random forest package than the
> comparable R packages.
>
> Note, particularly that the Mahout implementation is not regularized.  That
> could well be a big difference.
>

-- 

https://github.com/bearrito
@deepbearrito

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Ted Dunning

On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut  wrote:

> Has anyone found that Mahout's random forest doesn't perform as well as
> other implementations? If not, is there any reason why it wouldn't perform
> as well?
>

This is disappointing, but not entirely surprising.  There has been
considerably less effort applied to Mahouts random forest package than the
comparable R packages.

Note, particularly that the Mahout implementation is not regularized.  That
could well be a big difference.

Mahout 0.8 Random Forest Accuracy

2013-10-17 Thread Tim Peut

Hi all,

I'm using the random forest implementation in Mahout 0.8 to perform
classification (org.apache.mahout.classifier.df.mapreduce.BuildForest and
org.apache.mahout.classifier.df.mapreduce.TestForest). I've run the
classifier multiple times with different parameters and different data
splits, and consistently get accuracy of ~0.9.

I've previously used R's RRF package with the exact same data and I
consistently get accuracy of ~0.95, which is a fair bit higher than the
Mahout results. I've been unable to figure out why the classifiers perform
differently with the same data and the same parameters.

Has anyone found that Mahout's random forest doesn't perform as well as
other implementations? If not, is there any reason why it wouldn't perform
as well?

Cheers,
Tim

Re: Mahout 0.8 Random Forest Accuracy

Re: Mahout 0.8 Random Forest Accuracy

Re: Mahout 0.8 Random Forest Accuracy

RE: Mahout 0.8 Random Forest Accuracy

Re: Mahout 0.8 Random Forest Accuracy

Re: Mahout 0.8 Random Forest Accuracy

Re: Mahout 0.8 Random Forest Accuracy

Mahout 0.8 Random Forest Accuracy

8 matches

Site Navigation

Mail list logo

Footer information