On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <andy.tw...@gmail.com> wrote:

> ... Right now what we have is a
> single-machine procedure for scanning through some data, building a
> set of histograms, combining histograms and then expanding the tree.
> The next step is to decide the best way to distribute this. I'm not an
> expert here, so any advice or help here is welcome.
>

That sounds good so far.


> I think the easiest approach would be to use the mappers to construct
> the set of histograms, and then send all histograms for a given leaf
> to a reducer, which decides how to expand that leaf. The code I have
> can be almost be ported as-is to a mapper and reducer in this way.
> Would using the distributed cache to send the updated tree be wise, or
> is there a better way?
>

Distributed cache is a very limited thing.  You can only put things in at
program launch and they must remain constant throughout the program's run.

The problem here is that iterated map-reduce is pretty heinously
inefficient.

The best candidate approaches for avoiding that are to use a BSP sort of
model (see the Pregel paper at
http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
unsynchronized model update cycle the way that Vowpal Wabbit does with
all-reduce or the way that Google's deep learning system does.

Running these approaches on Hadoop without Yarn or Mesos requires a slight
perversion of the map-reduce paradigm, but is quite doable.

Reply via email to