There's this article on InfoQ that deals with this issue... ;-) http://www.infoq.com/articles/HadoopInputFormat
Sent from a remote device. Please excuse any typos... Mike Segel On Nov 12, 2011, at 7:51 AM, Harsh J <ha...@cloudera.com> wrote: > Tharindu, > > InputSplit#getLocations() i.e., > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html#getLocations() > is used to decide locality of a task. You need your custom InputFormat > to prepare the right array of these objects. The # of objects == # of > map tasks, and the locations array gets used by the scheduler for > local assignment. > > For a FileSplit preparation, this is as easy as passing the block > locations obtained from the NameNode. For the rest type of splits, you > need to fill them up yourself. > > On Sat, Nov 12, 2011 at 7:12 PM, Tharindu Mathew <mcclou...@gmail.com> wrote: >> Hi hadoop devs, >> >> I'm implementing a custom input format and want to understand how to make >> use of data locality. >> >> AFAIU, only file input format makes use of data locality since the job >> tracker picks data locality based on the block location defined in the file >> input split. >> >> So, the job tracker code is partly responsible for this. So providing data >> locality for a custom input format would be to either either extend file >> input format or modify job tracker code (if that makes sense even). >> >> Is my understanding correct? >> >> -- >> Regards, >> >> Tharindu >> >> blog: http://mackiemathew.com/ >> > > > > -- > Harsh J >