On Thu, Apr 5, 2012 at 7:03 AM, Mohit Anchlia <mohitanch...@gmail.com> wrote:
> Only advantage I was thinking of was that in some cases reducers might be
> able to take advantage of data locality and avoid multiple HTTP calls, no?
> Data is anyways written, so last merged file could go on HDFS instead of
> local disk.
> I am new to hadoop so just asking question to understand the rational
> behind using local disk for final output.

So basically it's a tradeoff here, you get more replicas to copy from
but you have 2 more copies to write. Considering that that data's very
short lived and that it doesn't need to be replicated (since if the
machine fails the maps are replayed anyway) it seems that writing 2
replicas that are potentially unused would be hurtful.

Regarding locality, it might make sense on a small cluster but the
more you add nodes the smaller the chance to have local replicas for
each blocks of data you're looking for.

J-D

Reply via email to