RE: HDFS interfaces

John Lilley Tue, 04 Jun 2013 11:36:59 -0700

When you use the HDFS client interface to read a file, it automatically figures 
out which datanodes to contact for reading which blocks.  There isn't really a 
"main" block.  However I have read that the first location listed for each 
block is the "recommended" one to read for an outside client.  Normally, an 
outside client doesn't need to know this information at all as the HDFS file 
interface takes care of it.  An "inside" application such as MapReduce *does* 
need to know this information so that it can run tasks on nodes that are 
"close" to the data split being processed.  If you are writing a custom 
ApplicationMaster using YARN, you will also want to know this.

John

From: Mahmood Naderan [mailto:[email protected]]
Sent: Tuesday, June 04, 2013 12:01 AM
To: [email protected]
Subject: Re: HDFS interfaces

There are many instances of getFileBlockLocations in hadoop/fs. Can you explain 
which one is the main?
>It must be combined with a method of logically splitting the input data along 
>block boundaries, and of launching tasks on worker nodes that >are close to 
>the data splits
Is this a user level task of system level task?

Regards,
Mahmood

________________________________
From: John Lilley <[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>; Mahmood Naderan 
<[email protected]<mailto:[email protected]>>
Sent: Tuesday, June 4, 2013 3:28 AM
Subject: RE: HDFS interfaces

Mahmood,

It is the in the FileSystem interface.
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.Path,
 long, 
long)<http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.Path,%20long,%20long)>

This by itself is not sufficient for application programmers to make good use 
of data locality.  It must be combined with a method of logically splitting the 
input data along block boundaries, and of launching tasks on worker nodes that 
are close to the data splits.  MapReduce does both of these things internally 
along with the file-format input classes.  For an application to do so 
directly, see the new YARN-based interfaces ApplicationMaster and 
ResourceManager.  These are however very new and there is little documentation 
or examples.

john

From: Mahmood Naderan [mailto:[email protected]]
Sent: Monday, June 03, 2013 12:09 PM
To: [email protected]<mailto:[email protected]>
Subject: HDFS interfaces

Hello,
It is stated in the "HDFS architecture guide" 
(https://hadoop.apache.org/docs/r1.0.4/hdfs_design.html) that

HDFS provides interfaces for applications to move themselves closer to where 
the data is located.

What are these interfaces and where they are in the source code? Is there any 
manual for the interfaces?

Regards,
Mahmood

RE: HDFS interfaces

Reply via email to