Hi Mehal, > I am confused over how MapReduce tasks select data blocks for processing user > requests ?
I suggest reading chapter 6 of Tom White's Hadoop: The Definitive Guide, titled "How MapReduce Works". It explains almost everything you need to know in very clear language, and should help you generally if you get this or other such good books. > As data block replication replicates single data block over multiple > datanodes, during job processing how uniquely data blocks are selected for > processing user requests ? The first point to clear up is that MapReduce is not hard-tied to HDFS. It generates splits on any FS and the splits are unique, based on your given input path. Each split therefore relates to one task and the task's input goal is hence defined at submit-time itself. Each split is further defined by its path, start offset into the file and length after offset to be processed - "uniquely" defining itself. > How does it guarantees that no same block gets chosen twice or thrice for > different mapper task. See above - each "block" (or a "split" in MR terms), is defined by its start-offset and length. No two splits generated for a single file would be the same, as we generate it that way - and hence there won't be such a case you're worried about. On Sat, Feb 9, 2013 at 6:10 AM, Mehal Patel <[email protected]> wrote: > Hello All, > > I am confused over how MapReduce tasks select data blocks for processing > user requests ? > > As data block replication replicates single data block over multiple > datanodes, during job processing how uniquely > data blocks are selected for processing user requests ? How does it > guarantees that no same block gets chosen twice or thrice > for different mapper task. > > > Thank you > > -Mehal -- Harsh J
