Re: [gsoc] random forests

Ted Dunning Sun, 29 Mar 2009 15:59:44 -0700

I have two answers for you.

The first is that for any given application, the odds that the data will not
fit in a single machine are small, especially if you have an out-of-core
tree builder.  Really, really big datasets are increasingly common, but are
still a small minority of all datasets.

The second answer is that the odds that SOME mahout application will be too
large for a single node are quite high.

These aren't contradictory.  They just describe the long-tail nature of
problem sizes.

One question I have about your plan is whether your step (1) involves
building trees or forests only from data held in memory or whether it can be
adapted to stream through the data (possibly several times).  If a streaming
implementation is viable, then it may well be that performance is still
quite good for small datasets due to buffering.

If streaming works, then a single node will be able to handle very large
datasets but will just be kind of slow.  As you point out, that can be
remedied trivially.

Another way to put this is that the key question is how single node
computation scales with input size.  If the scaling is relatively linear
with data size, then your approach (3) will work no matter the data size.
If scaling shows an evil memory size effect, then your approach (2) would be
required for large data sets.

On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim <a_dene...@yahoo.fr>wrote:

> My question is : when Mahout.RF will be used in a real application, what
> are the odds that the dataset will be so large that it can't fit on every
> machine of the cluster ?
>
> the answer to this question should help me decide which implementation I'll
> choose.
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)

Re: [gsoc] random forests

Reply via email to