When MR assigns data splits to map tasks, does it assign a set of 
non-contiguous blocks to one map?  The reason I ask is, thinking through the 
problem, if I were the MR scheduler I would attempt to hand a map task a bunch 
of blocks that all exist on the same datanode, and then schedule the map task 
on that node.  E.g. if I have an HDFS file with 10000 blocks and I want to 
create 1000 map tasks I'd like each map task to have 10 blocks, but those 
blocks are unlikely to be contiguous on a given datanode.

This is related to a question I had asked earlier, which is whether any benefit 
could be had by aligning data splits along block boundaries to avoid slopping 
reads of a block to the next block and requiring another datanode connection.  
The answer I got was that the extra connection overhead wasn't important.  The 
reason I bring this up again is that comments in this discussion 
(https://issues.apache.org/jira/browse/HADOOP-3315) imply that doing an extra 
seek to the beginning of the file to read a magic number on open is a 
significant overhead, and this looks like a similar issue to me.

Thanks,
john

Reply via email to