Data locality for a custom input format

2011-11-12 Thread Tharindu Mathew
Hi hadoop devs,

I'm implementing a custom input format and want to understand how to make
use of data locality.

AFAIU, only file input format makes use of data locality since the job
tracker picks data locality based on the block location defined in the file
input split.

So, the job tracker code is partly responsible for this. So providing data
locality for a custom input format would be to either either extend file
input format or modify job tracker code (if that makes sense even).

Is my understanding correct?

-- 
Regards,

Tharindu

blog: http://mackiemathew.com/


Re: Data locality for a custom input format

2011-11-12 Thread Harsh J
Tharindu,

InputSplit#getLocations()  i.e.,
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html#getLocations()
is used to decide locality of a task. You need your custom InputFormat
to prepare the right array of these objects. The # of objects == # of
map tasks, and the locations array gets used by the scheduler for
local assignment.

For a FileSplit preparation, this is as easy as passing the block
locations obtained from the NameNode. For the rest type of splits, you
need to fill them up yourself.

On Sat, Nov 12, 2011 at 7:12 PM, Tharindu Mathew mcclou...@gmail.com wrote:
 Hi hadoop devs,

 I'm implementing a custom input format and want to understand how to make
 use of data locality.

 AFAIU, only file input format makes use of data locality since the job
 tracker picks data locality based on the block location defined in the file
 input split.

 So, the job tracker code is partly responsible for this. So providing data
 locality for a custom input format would be to either either extend file
 input format or modify job tracker code (if that makes sense even).

 Is my understanding correct?

 --
 Regards,

 Tharindu

 blog: http://mackiemathew.com/




-- 
Harsh J


Re: Data locality for a custom input format

2011-11-12 Thread Michel Segel
There's this article on InfoQ that deals with this issue... ;-)

http://www.infoq.com/articles/HadoopInputFormat

Sent from a remote device. Please excuse any typos...

Mike Segel

On Nov 12, 2011, at 7:51 AM, Harsh J ha...@cloudera.com wrote:

 Tharindu,
 
 InputSplit#getLocations()  i.e.,
 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html#getLocations()
 is used to decide locality of a task. You need your custom InputFormat
 to prepare the right array of these objects. The # of objects == # of
 map tasks, and the locations array gets used by the scheduler for
 local assignment.
 
 For a FileSplit preparation, this is as easy as passing the block
 locations obtained from the NameNode. For the rest type of splits, you
 need to fill them up yourself.
 
 On Sat, Nov 12, 2011 at 7:12 PM, Tharindu Mathew mcclou...@gmail.com wrote:
 Hi hadoop devs,
 
 I'm implementing a custom input format and want to understand how to make
 use of data locality.
 
 AFAIU, only file input format makes use of data locality since the job
 tracker picks data locality based on the block location defined in the file
 input split.
 
 So, the job tracker code is partly responsible for this. So providing data
 locality for a custom input format would be to either either extend file
 input format or modify job tracker code (if that makes sense even).
 
 Is my understanding correct?
 
 --
 Regards,
 
 Tharindu
 
 blog: http://mackiemathew.com/
 
 
 
 
 -- 
 Harsh J