I am going through the chapter "How mapreduce works" and have some confusion:
1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under "The Reduce Side" doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper -> reducer or from reducer -> mapper? And second, Is the call http:// or hdfs:// 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-00000 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? ----- from the book --- Mapper The output file’s partitions are made available to the reducers over HTTP. The number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property this setting is per tasktracker, not per map task slot. The default of 40 may need increasing for large clusters running large jobs.6.4.2. The Reduce Side Let’s turn now to the reduce part of the process. The map output file is sitting on the local disk of the tasktracker that ran the map task (note that although map outputs always get written to the local disk of the map tasktracker, reduce outputs may not be), but now it is needed by the tasktracker that is about to run the reduce task for the partition. Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.