Ok...I'll have to update my working plan. I could leave the choice of the proper implementation (A or B) after I finish the reference implementation, or I could go straight to the implementation B, the only implementation that could handle extra large datasets.
Implementation A. the dataset is available on all cluster nodes, each node builds a somme trees of the forest. Implementation B. each node contains only a subset from the dataset, each node builds a subset from every tree of the forest What do you suggest ? And just to make sure, by 'non-paralel implementation' you mean the reference implementation, right ? --- En date de : Lun 30.3.09, Ted Dunning <ted.dunn...@gmail.com> a écrit : > De: Ted Dunning <ted.dunn...@gmail.com> > Objet: Re: [gsoc] random forests > À: mahout-dev@lucene.apache.org > Date: Lundi 30 Mars 2009, 18h49 > Indeed. And those datasets > exist. > > It is also plausible that this full data scan approach will > fail when you > want the forest building to take less time. > > It is also plausible that a full data scan approach fails > to improve enough > on a non-parallel implementation. This would happen > if a significantly > large fraction of the entire forest could be built on a > single node. That > would happen if the CPU requirements for forest building > are overshadowed by > the I/O cost of scanning the data set. This would > imply that there is a > small limit to the amount of parallelism that would help. > > You will know much more about this after you finish the > non-parallel > implementation than either of us knows now. > > On Mon, Mar 30, 2009 at 7:24 AM, deneche abdelhakim <a_dene...@yahoo.fr>wrote: > > > There is still one case that this approach, even > out-of-core, cannot > > handle: very large datasets that cannot fit in the > node hard-drive, and thus > > must be distributed across the cluster. > > > > > > -- > Ted Dunning, CTO > DeepDyve >