I have two answers for you. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets.
The second answer is that the odds that SOME mahout application will be too large for a single node are quite high. These aren't contradictory. They just describe the long-tail nature of problem sizes. One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. If streaming works, then a single node will be able to handle very large datasets but will just be kind of slow. As you point out, that can be remedied trivially. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim <a_dene...@yahoo.fr>wrote: > My question is : when Mahout.RF will be used in a real application, what > are the odds that the dataset will be so large that it can't fit on every > machine of the cluster ? > > the answer to this question should help me decide which implementation I'll > choose. > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)