[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425344#comment-13425344 ]
Todd Lipcon commented on HDFS-3672: ----------------------------------- bq. I'll ask again since I didn't get a response - wouldn't it make sense to commit this patch to a dev-branch. Use that to prototype changes to either MapReduce or HBase and then merge it in? There are projects outside of just HBase and MapReduce that would like to run against this, some of which are not Apache projects. As I mentioned above, we have at least one customer who would like to use this feature in their code to get better disk efficiency. They need to run against an actual release, not a dev branch build. This is the primary use case we're targeting right now. I want to be perfectly honest: the HBase/MR examples I gave above are not on our immediate roadmap; they just serve as proof that this isn't a one-off/niche improvement. The other downside with a dev branch is that it's difficult for downstream OSS projects to integrate against something that's not in a release. HBase already has to build against several different Maven profiles to support 1.0, 0.23, and 2.0. Adding another profile against a dev branch not available in maven is not feasible. This isn't the first time an API has been added to the trunk code before downstream users exist. For example, FileContext was in Hadoop for somewhere around a year before MR2 started to migrate to it. The "New MR API" is still barely used based on my discussions with users. If there is sufficient motivation (plus customer demand) for an API, and the API is explicitly marked Unstable, what's the problem with including it? It's entirely new code and has no risk of destabilizing the existing feature set. I fear that blocking APIs like this from Apache will only serve to fracture the Hadoop user base, pushing us back towards the 0.20-era nightmare of distinct distros with distinct non-overlapping capabilities. Do you have a technical objection to the new code: for example, a reason why it will destabilize the existing feature set? > Expose disk-location information for blocks to enable better scheduling > ----------------------------------------------------------------------- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 2.0.0-alpha > Reporter: Andrew Wang > Assignee: Andrew Wang > Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira