Doubt from the book "Definitive Guide"

Mohit Anchlia Wed, 04 Apr 2012 16:56:54 -0700

I am going through the chapter "How mapreduce works" and have some
confusion:


1) Below description of Mapper says that reducers get the output file using
HTTP call. But the description under "The Reduce Side" doesn't specifically
say if it's copied using HTTP. So first confusion, Is the output copied
from mapper -> reducer or from reducer -> mapper? And second, Is the call
http:// or hdfs://

2) My understanding was that mapper output gets written to hdfs, since I've
seen part-m-00000 files in hdfs. If mapper output is written to HDFS then
shouldn't reducers simply read it from hdfs instead of making http calls to
tasktrackers location?



----- from the book ---
Mapper
The output file’s partitions are made available to the reducers over HTTP.
The number of worker threads used to serve the file partitions is
controlled by the tasktracker.http.threads property
this setting is per tasktracker, not per map task slot. The default of 40
may need increasing for large clusters running large jobs.6.4.2.

The Reduce Side
Let’s turn now to the reduce part of the process. The map output file is
sitting on the local disk of the tasktracker that ran the map task
(note that although map outputs always get written to the local disk of the
map tasktracker, reduce outputs may not be), but now it is needed by the
tasktracker
that is about to run the reduce task for the partition. Furthermore, the
reduce task needs the map output for its particular partition from several
map tasks across the cluster.
The map tasks may finish at different times, so the reduce task starts
copying their outputs as soon as each completes. This is known as the copy
phase of the reduce task.
The reduce task has a small number of copier threads so that it can fetch
map outputs in parallel.
The default is five threads, but this number can be changed by setting the
mapred.reduce.parallel.copies property.

Doubt from the book "Definitive Guide"

Reply via email to