RJ Nowling created HDFS-6116:
--------------------------------
Summary: RFC: Make getFileBlockLocations part of the public
WebHDFS API
Key: HDFS-6116
URL: https://issues.apache.org/jira/browse/HDFS-6116
Project: Hadoop HDFS
Issue Type: Improvement
Components: webhdfs
Reporter: RJ Nowling
Other projects such as Disco, a MapReduce framework written in Erlang / Python,
want to support the HDFS file system. WebHDFS provides a great means of doing
so, but it does not provide information about data locality as part of the
public API. Information about data locality is important for scheduling I/O
operations and tasks efficiently.
HDFS-2340 added support for getFileBlockLocations, but there is no mention of
this support in the API documentation. Comments in the source indicate that
this is a private API.
The WebHDFS API redirects I/O requests to the datanode containing the first
block of the request. Knowing the block size and file size, this feature can
be abused to query data locality information, but it will require multiple
requests to the namenode which will add unnecessary overhead.
Thoughts:
1) Why is getFileBlockLocations private?
2) If there is no good reason, can we make it public?
3) If there are problems that keep it private, can we design an API that could
be used by external users to more efficiently handle data locality issues?
--
This message was sent by Atlassian JIRA
(v6.2#6252)