RJ Nowling created HDFS-6116: -------------------------------- Summary: RFC: Make getFileBlockLocations part of the public WebHDFS API Key: HDFS-6116 URL: https://issues.apache.org/jira/browse/HDFS-6116 Project: Hadoop HDFS Issue Type: Improvement Components: webhdfs Reporter: RJ Nowling
Other projects such as Disco, a MapReduce framework written in Erlang / Python, want to support the HDFS file system. WebHDFS provides a great means of doing so, but it does not provide information about data locality as part of the public API. Information about data locality is important for scheduling I/O operations and tasks efficiently. HDFS-2340 added support for getFileBlockLocations, but there is no mention of this support in the API documentation. Comments in the source indicate that this is a private API. The WebHDFS API redirects I/O requests to the datanode containing the first block of the request. Knowing the block size and file size, this feature can be abused to query data locality information, but it will require multiple requests to the namenode which will add unnecessary overhead. Thoughts: 1) Why is getFileBlockLocations private? 2) If there is no good reason, can we make it public? 3) If there are problems that keep it private, can we design an API that could be used by external users to more efficiently handle data locality issues? -- This message was sent by Atlassian JIRA (v6.2#6252)