That sounds like a horrid amount of work to do something simple. Is there a hadoop implementation of a master-workers problem you can point me to? On Mar 7, 2013 9:57 PM, "Ted Dunning" <[email protected]> wrote:
> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote: > > > ... Right now what we have is a > > single-machine procedure for scanning through some data, building a > > set of histograms, combining histograms and then expanding the tree. > > The next step is to decide the best way to distribute this. I'm not an > > expert here, so any advice or help here is welcome. > > > > That sounds good so far. > > > > I think the easiest approach would be to use the mappers to construct > > the set of histograms, and then send all histograms for a given leaf > > to a reducer, which decides how to expand that leaf. The code I have > > can be almost be ported as-is to a mapper and reducer in this way. > > Would using the distributed cache to send the updated tree be wise, or > > is there a better way? > > > > Distributed cache is a very limited thing. You can only put things in at > program launch and they must remain constant throughout the program's run. > > The problem here is that iterated map-reduce is pretty heinously > inefficient. > > The best candidate approaches for avoiding that are to use a BSP sort of > model (see the Pregel paper at > http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an > unsynchronized model update cycle the way that Vowpal Wabbit does with > all-reduce or the way that Google's deep learning system does. > > Running these approaches on Hadoop without Yarn or Mesos requires a slight > perversion of the map-reduce paradigm, but is quite doable. >
