Data locality for a custom input format
Hi hadoop devs, I'm implementing a custom input format and want to understand how to make use of data locality. AFAIU, only file input format makes use of data locality since the job tracker picks data locality based on the block location defined in the file input split. So, the job tracker code is partly responsible for this. So providing data locality for a custom input format would be to either either extend file input format or modify job tracker code (if that makes sense even). Is my understanding correct? -- Regards, Tharindu blog: http://mackiemathew.com/
Re: Data locality for a custom input format
Tharindu, InputSplit#getLocations() i.e., http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html#getLocations() is used to decide locality of a task. You need your custom InputFormat to prepare the right array of these objects. The # of objects == # of map tasks, and the locations array gets used by the scheduler for local assignment. For a FileSplit preparation, this is as easy as passing the block locations obtained from the NameNode. For the rest type of splits, you need to fill them up yourself. On Sat, Nov 12, 2011 at 7:12 PM, Tharindu Mathew mcclou...@gmail.com wrote: Hi hadoop devs, I'm implementing a custom input format and want to understand how to make use of data locality. AFAIU, only file input format makes use of data locality since the job tracker picks data locality based on the block location defined in the file input split. So, the job tracker code is partly responsible for this. So providing data locality for a custom input format would be to either either extend file input format or modify job tracker code (if that makes sense even). Is my understanding correct? -- Regards, Tharindu blog: http://mackiemathew.com/ -- Harsh J
Re: Data locality for a custom input format
There's this article on InfoQ that deals with this issue... ;-) http://www.infoq.com/articles/HadoopInputFormat Sent from a remote device. Please excuse any typos... Mike Segel On Nov 12, 2011, at 7:51 AM, Harsh J ha...@cloudera.com wrote: Tharindu, InputSplit#getLocations() i.e., http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html#getLocations() is used to decide locality of a task. You need your custom InputFormat to prepare the right array of these objects. The # of objects == # of map tasks, and the locations array gets used by the scheduler for local assignment. For a FileSplit preparation, this is as easy as passing the block locations obtained from the NameNode. For the rest type of splits, you need to fill them up yourself. On Sat, Nov 12, 2011 at 7:12 PM, Tharindu Mathew mcclou...@gmail.com wrote: Hi hadoop devs, I'm implementing a custom input format and want to understand how to make use of data locality. AFAIU, only file input format makes use of data locality since the job tracker picks data locality based on the block location defined in the file input split. So, the job tracker code is partly responsible for this. So providing data locality for a custom input format would be to either either extend file input format or modify job tracker code (if that makes sense even). Is my understanding correct? -- Regards, Tharindu blog: http://mackiemathew.com/ -- Harsh J