RJ Nowling created HDFS-6116:
--------------------------------

             Summary: RFC: Make getFileBlockLocations part of the public 
WebHDFS API
                 Key: HDFS-6116
                 URL: https://issues.apache.org/jira/browse/HDFS-6116
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: webhdfs
            Reporter: RJ Nowling


Other projects such as Disco, a MapReduce framework written in Erlang / Python, 
want to support the HDFS file system.  WebHDFS provides a great means of doing 
so, but it does not provide information about data locality as part of the 
public API.  Information about data locality is important for scheduling I/O 
operations and tasks efficiently.

HDFS-2340 added support for getFileBlockLocations, but there is no mention of 
this support in the API documentation.  Comments in the source indicate that 
this is a private API.

The WebHDFS API redirects I/O requests to the datanode containing the first 
block of the request.  Knowing the block size and file size, this feature can 
be abused to query data locality information, but it will require multiple 
requests to the namenode which will add unnecessary overhead.

Thoughts:
1) Why is getFileBlockLocations private?  
2) If there is no good reason, can we make it public?
3) If there are problems that keep it private, can we design an API that could 
be used by external users to more efficiently handle data locality issues?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to