To my understanding, if data resides in HDFS, then the JobTracker will make use of location information to allocate data to the TaskTracker and hence can reduce the data movement between the data source and the Mapper. Data movement between Mapper and Reducer is harder to minimize (maybe providing an application specific partitioner).
Hadoop is targeting at parallel complex processing algorithm, where the data movement overhead is relatively insignificant. It is not for everything. Rgds, Ricky -----Original Message----- From: Doopah Shaf [mailto:doopha.s...@gmail.com] Sent: Sunday, December 20, 2009 11:14 PM To: common-user@hadoop.apache.org Subject: how does hadoop work? Trying to figure out how hadoop actually achieves its speed. Assuming that data locality is central to the efficiency of hadoop, how does the magic actually happen, given that data still gets moved all over the network to reach the reducers? For example, if I have 1gb of logs spread across 10 data nodes, and for the sake of argument, assume I use the identity mapper. Then 90% of data still needs to move across the network - how does the network not become saturated this way? What did I miss?... Thanks, D.S.