A reasonable question is: are you sure you need coordination?

A lot of problems that look like they require coordination between mappers,
can actually be made to work more scalably (and much more simply!) by
decomposing them into two back-to-back MapReduce jobs, where some data is
exchanged /aggregated in the middle.

Also, you can send data to all the mappers ahead of time by writing out data
into files that are spread to all the mappers using the DistributedCache. So
if you have a relatively small amount of metadata that's known ahead of the
job, you could send that out in a DCache file to all the mappers.

Maybe if you could describe your goals more fully, we could offer some more
specific advice on your situation.

Cheers,
- Aaron


On Thu, Mar 19, 2009 at 10:03 AM, Owen O'Malley <omal...@apache.org> wrote:

>
> On Mar 18, 2009, at 10:26 AM, Stuart White wrote:
>
>  I'd like to implement some coordination between Mapper tasks running
>> on the same node.  I was thinking of using ZooKeeper to provide this
>> coordination.
>>
>
> This is a very bad idea in the general case. It can be made to work, but
> you need to have a dedicated cluster so that you are sure they are all
> active at the same time. Otherwise, you have no guarantee that all of the
> maps are running at the same time.
>
> In most cases, you are much better off using the standard communication
> between the maps and reduces and making multiple passes of jobs.
>
>  I think I remember hearing that MapReduce and/or HDFS use ZooKeeper
>> under-the-covers.
>>
>
> There are no immediate plans to implement HA yet.
>
> -- Owen
>

Reply via email to