Ok...I'll have to update my working plan. I could leave the choice of the 
proper implementation (A or B) after I finish the reference implementation, or 
I could go straight to the implementation B, the only implementation that could 
handle extra large datasets.

Implementation A. the dataset is available on all cluster nodes, each node 
builds a somme trees of the forest.
Implementation B. each node contains only a subset from the dataset, each node 
builds a subset from every tree of the forest

What do you suggest ?

And just to make sure, by 'non-paralel implementation' you mean the reference 
implementation, right ?

--- En date de : Lun 30.3.09, Ted Dunning <ted.dunn...@gmail.com> a écrit :

> De: Ted Dunning <ted.dunn...@gmail.com>
> Objet: Re: [gsoc] random forests
> À: mahout-dev@lucene.apache.org
> Date: Lundi 30 Mars 2009, 18h49
> Indeed.  And those datasets
> exist.
> 
> It is also plausible that this full data scan approach will
> fail when you
> want the forest building to take less time.
> 
> It is also plausible that a full data scan approach fails
> to improve enough
> on a non-parallel implementation.  This would happen
> if a significantly
> large fraction of the entire forest could be built on a
> single node.  That
> would happen if the CPU requirements for forest building
> are overshadowed by
> the I/O cost of scanning the data set.  This would
> imply that there is a
> small limit to the amount of parallelism that would help.
> 
> You will know much more about this after you finish the
> non-parallel
> implementation than either of us knows now.
> 
> On Mon, Mar 30, 2009 at 7:24 AM, deneche abdelhakim <a_dene...@yahoo.fr>wrote:
> 
> > There is still one case that this approach, even
> out-of-core, cannot
> > handle: very large datasets that cannot fit in the
> node hard-drive, and thus
> > must be distributed across the cluster.
> >
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 



Reply via email to