[jira] [Commented] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

Todd Lipcon (JIRA) Mon, 30 Jul 2012 15:58:37 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425344#comment-13425344
 ]


Todd Lipcon commented on HDFS-3672:
-----------------------------------

bq. I'll ask again since I didn't get a response - wouldn't it make sense to 
commit this patch to a dev-branch. Use that to prototype changes to either 
MapReduce or HBase and then merge it in?

There are projects outside of just HBase and MapReduce that would like to run 
against this, some of which are not Apache projects. As I mentioned above, we 
have at least one customer who would like to use this feature in their code to 
get better disk efficiency. They need to run against an actual release, not a 
dev branch build. This is the primary use case we're targeting right now. I 
want to be perfectly honest: the HBase/MR examples I gave above are not on our 
immediate roadmap; they just serve as proof that this isn't a one-off/niche 
improvement.

The other downside with a dev branch is that it's difficult for downstream OSS 
projects to integrate against something that's not in a release. HBase already 
has to build against several different Maven profiles to support 1.0, 0.23, and 
2.0. Adding another profile against a dev branch not available in maven is not 
feasible.

This isn't the first time an API has been added to the trunk code before 
downstream users exist. For example, FileContext was in Hadoop for somewhere 
around a year before MR2 started to migrate to it. The "New MR API" is still 
barely used based on my discussions with users. If there is sufficient 
motivation (plus customer demand) for an API, and the API is explicitly marked 
Unstable, what's the problem with including it? It's entirely new code and has 
no risk of destabilizing the existing feature set.

I fear that blocking APIs like this from Apache will only serve to fracture the 
Hadoop user base, pushing us back towards the 0.20-era nightmare of distinct 
distros with distinct non-overlapping capabilities.

Do you have a technical objection to the new code: for example, a reason why it 
will destabilize the existing feature set?
                
> Expose disk-location information for blocks to enable better scheduling
> -----------------------------------------------------------------------
>
>                 Key: HDFS-3672
>                 URL: https://issues.apache.org/jira/browse/HDFS-3672
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

Reply via email to