Re: Out-of-core random forest implementation

Marty Kube Fri, 08 Mar 2013 05:42:55 -0800


On 03/07/2013 04:56 PM, Ted Dunning wrote:

On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote:

... Right now what we have is a
single-machine procedure for scanning through some data, building a
set of histograms, combining histograms and then expanding the tree.
The next step is to decide the best way to distribute this. I'm not an
expert here, so any advice or help here is welcome.

That sounds good so far.

I think the easiest approach would be to use the mappers to construct
the set of histograms, and then send all histograms for a given leaf
to a reducer, which decides how to expand that leaf. The code I have
can be almost be ported as-is to a mapper and reducer in this way.
Would using the distributed cache to send the updated tree be wise, or
is there a better way?

Distributed cache is a very limited thing.  You can only put things in at
program launch and they must remain constant throughout the program's run.

The problem here is that iterated map-reduce is pretty heinously
inefficient.

The best candidate approaches for avoiding that are to use a BSP sort of
model (see the Pregel paper at
http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
unsynchronized model update cycle the way that Vowpal Wabbit does with
all-reduce or the way that Google's deep learning system does.

Running these approaches on Hadoop without Yarn or Mesos requires a slight
perversion of the map-reduce paradigm, but is quite doable.

It seems like we might be jumping through some hoops to do this in MapReduce. Is it possible to use Yarn or Mesos? Is Mahout targeted atonly MR V1?

Re: Out-of-core random forest implementation

Reply via email to