Re: Out-of-core random forest implementation

Andy Twigg Thu, 07 Mar 2013 15:01:33 -0800

That sounds like a horrid amount of work to do something simple. Is there a
hadoop implementation of a master-workers problem you can point me to?
On Mar 7, 2013 9:57 PM, "Ted Dunning" <[email protected]> wrote:


> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote:
>
> > ... Right now what we have is a
> > single-machine procedure for scanning through some data, building a
> > set of histograms, combining histograms and then expanding the tree.
> > The next step is to decide the best way to distribute this. I'm not an
> > expert here, so any advice or help here is welcome.
> >
>
> That sounds good so far.
>
>
> > I think the easiest approach would be to use the mappers to construct
> > the set of histograms, and then send all histograms for a given leaf
> > to a reducer, which decides how to expand that leaf. The code I have
> > can be almost be ported as-is to a mapper and reducer in this way.
> > Would using the distributed cache to send the updated tree be wise, or
> > is there a better way?
> >
>
> Distributed cache is a very limited thing.  You can only put things in at
> program launch and they must remain constant throughout the program's run.
>
> The problem here is that iterated map-reduce is pretty heinously
> inefficient.
>
> The best candidate approaches for avoiding that are to use a BSP sort of
> model (see the Pregel paper at
> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
> unsynchronized model update cycle the way that Vowpal Wabbit does with
> all-reduce or the way that Google's deep learning system does.
>
> Running these approaches on Hadoop without Yarn or Mesos requires a slight
> perversion of the map-reduce paradigm, but is quite doable.
>

Re: Out-of-core random forest implementation

Reply via email to