DS,

What you say is true, but there are finer points:

   1. Data transfer can begin while the mapper is working through the data.
   You would still bottleneck on the network if: (a) you have enough nodes and
   spindles such that the aggregate disk transfer speed is greater than the
   network capacity, and (b) the computation is trivial such that you produce
   data faster than the network can sustain.
   2. Compression can work wonders. You also get a better compression ratio
   if you have a large map output key. (Imagine: {K,V1}, {K,V2}, {K,V3} gets
   transferred as {K, {V1, V2, V3}})
   3. In reality, most algorithms can be designed such that the map output
   is much much smaller than the input data (e.g., count, sum, min, max, etc.).
   4. If you're doing a simple transformation where 1 line of input = 1 line
   of output (e.g., the identity mapper), then you can configure those to be
   map only jobs, thus no shuffle.

Hope this helps,

- P
On Mon, Dec 21, 2009 at 2:14 AM, Doopah Shaf <doopha.s...@gmail.com> wrote:

> Trying to figure out how hadoop actually achieves its speed. Assuming that
> data locality is central to the efficiency of hadoop, how does the magic
> actually happen, given that data still gets moved all over the network to
> reach the reducers?
>
> For example, if I have 1gb of logs spread across 10 data nodes, and for the
> sake of argument, assume I use the identity mapper. Then 90% of data still
> needs to move across the network - how does the network not become
> saturated
> this way?
>
> What did I miss?...
> Thanks,
> D.S.
>

Reply via email to