On 03/07/2013 04:56 PM, Ted Dunning wrote:
On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote:
... Right now what we have is a
single-machine procedure for scanning through some data, building a
set of histograms, combining histograms and then expanding the tree.
The next step is to decide the best way to distribute this. I'm not an
expert here, so any advice or help here is welcome.
That sounds good so far.
I think the easiest approach would be to use the mappers to construct
the set of histograms, and then send all histograms for a given leaf
to a reducer, which decides how to expand that leaf. The code I have
can be almost be ported as-is to a mapper and reducer in this way.
Would using the distributed cache to send the updated tree be wise, or
is there a better way?
Distributed cache is a very limited thing. You can only put things in at
program launch and they must remain constant throughout the program's run.
The problem here is that iterated map-reduce is pretty heinously
inefficient.
The best candidate approaches for avoiding that are to use a BSP sort of
model (see the Pregel paper at
http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
unsynchronized model update cycle the way that Vowpal Wabbit does with
all-reduce or the way that Google's deep learning system does.
Running these approaches on Hadoop without Yarn or Mesos requires a slight
perversion of the map-reduce paradigm, but is quite doable.
It seems like we might be jumping through some hoops to do this in Map
Reduce. Is it possible to use Yarn or Mesos? Is Mahout targeted at
only MR V1?