[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated HDFS-3672: - Resolution: Fixed Fix Version/s: 2.2.0-alpha Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've just committed this to trunk and branch-2. Thanks a lot for the contribution, Andrew, and thanks a lot to Suresh, Arun, et al for the discussion. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Fix For: 2.2.0-alpha > > Attachments: design-doc-v1.pdf, design-doc-v2.pdf, > hdfs-3672-10.patch, hdfs-3672-11.patch, hdfs-3672-12.patch, > hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, hdfs-3672-4.patch, > hdfs-3672-5.patch, hdfs-3672-6.patch, hdfs-3672-7.patch, hdfs-3672-8.patch, > hdfs-3672-9.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-12.patch My bad, should have followed that hostname change a little more closely. Now it passes the conf parameter down to be properly obeyed by the RPC threads. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, design-doc-v2.pdf, > hdfs-3672-10.patch, hdfs-3672-11.patch, hdfs-3672-12.patch, > hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, hdfs-3672-4.patch, > hdfs-3672-5.patch, hdfs-3672-6.patch, hdfs-3672-7.patch, hdfs-3672-8.patch, > hdfs-3672-9.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-11.patch Rebase try 2. A lesson about compile testing after a rebase has been learned. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, design-doc-v2.pdf, > hdfs-3672-10.patch, hdfs-3672-11.patch, hdfs-3672-1.patch, hdfs-3672-2.patch, > hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch, > hdfs-3672-7.patch, hdfs-3672-8.patch, hdfs-3672-9.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-10.patch Rebase patch on trunk. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, design-doc-v2.pdf, > hdfs-3672-10.patch, hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, > hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch, hdfs-3672-7.patch, > hdfs-3672-8.patch, hdfs-3672-9.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-9.patch > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, design-doc-v2.pdf, hdfs-3672-1.patch, > hdfs-3672-2.patch, hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, > hdfs-3672-6.patch, hdfs-3672-7.patch, hdfs-3672-8.patch, hdfs-3672-9.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: design-doc-v2.pdf > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, design-doc-v2.pdf, hdfs-3672-1.patch, > hdfs-3672-2.patch, hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, > hdfs-3672-6.patch, hdfs-3672-7.patch, hdfs-3672-8.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-8.patch Nit addressed. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, > hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch, > hdfs-3672-7.patch, hdfs-3672-8.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-7.patch > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, > hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch, > hdfs-3672-7.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-6.patch Thanks for the detailed review ATM, I tried to address all your comments. I broke out the huge DFSClient method into a few smaller ones, which are still a bit large but logically sound. I can try to go further with this, but it'll mean passing more stuff in parameters. The config option I added ("dfs.client.file-block-locations.enabled") is default off, and checked client-side only. I could add this to the DN side too if we want to be really sure. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, > hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-5.patch Another patch rev, basically just doing stylistic cleanups. I haven't heard any code-related feedback in a while, so I haven't changed any classnames or added any conf options. I've tried to satisfy all the comments thus far, and I would really like to get this in soon if possible. Happy to listen to any further feedback about what I can do to make this happen. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, > hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: design-doc-v1.pdf Attaching design doc detailing the usecases, and trying to plot out the future direction. Happy to expand on anything unclear. Overall, I feel like there's strong interest in the API from multiple parties (the unnamed Cloudera customer, HBase, MR), and fairly clear potential performance improvements. I'd appreciate any advice on making it crystal clear to downstream users that this is an unstable API. We've already got the appropriate annotations, and I could also make it require a config option before doing anything useful (which I think satisfies "default off"). Any other suggestions? > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, > hdfs-3672-3.patch, hdfs-3672-4.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-4.patch Fix findbugs. I also parallelized the DN RPCs with Callables and a threadpool. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, > hdfs-3672-4.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Status: Patch Available (was: Open) > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-3.patch Nuked the method from {{FileSystem}}, and did some small cleanups. I also added {{InterfaceStability.Unstable}} annotations on the new classes in hadoop-common. Still open to renaming suggestions, if desired. I'd like to keep them named {{.*BlockLocation}} for consistency, because they are subclasses of {{BlockLocation}}. Perhaps {{HdfsBlockLocation}} for the {{DistributedFileSystem}} API, and {{LocatedBlockLocation}} for the internal {{LocatedBlock}} wrapper? > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-2.patch Newer version of the patch, addressing Todd and Tom's comments. One unfortunate bit is that to split the NN and DN RPCs, I needed to add a subclass of {{BlockLocation}} that hides a corresponding {{LocatedBlock}}. An array of these {{HdfsBlockLocation}} is now returned by {{DFS#getFileBlockLocations}}, and downcasted in {{DFSClient#getDiskBlockLocations}} to retrieve the {{LocatedBlock}}. I took Todd's advice about {{Integer.MAX_VALUE}} for denoting invalid blocks, but turn it a boolean accessible via {{DiskId#isValid}} before it's shown to consumers of {{FileSystem}}. Finally, I already renamed things to {{DiskBlockLocation}} based on Tom's comment, and reused the name {{HdfsBlockLocation}} for my {{LocatedBlock}} wrapper class. I can re-rename both of these if we don't like them. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling
[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-3672: -- Attachment: hdfs-3672-1.patch First hack at this. I still want to add some more tests, but I think the design is about right. This essentially provides the same API as {{DFS#getFileBlockLocations}}, except it returns a subclass of {{BlockLocation}}, {{HdfsBlockLocation}}, which has an additional array of {{byte}}s which is an opaque identifier that specifies on which disk on a datanode the block resides. Currently, this ID is mapped to the index of the HDFS data directory containing the block file (e.g. /data/1, /data/2). This can thus change across reboots/config changes, and clients need to be prepared to requery anyway since blocks do move around as part of normal operation. I'd like to perhaps split the new {{DFS#getFileHdfsBlockLocations}} function into a call to {{DFS#getFileBlockLocations}} to do the NN query to get block locations, and then pass these to some other call ({{DFS#getDiskIds}}?), since this would let you do multiple calls to {{DFS#getFileBlockLocations}} and then do one series of RPCs to the datanodes. But, I need to figure out how to change the {{BlockLocation[]}} back into a {{LocatedBlock[]}}. It might also be nice to do the DN RPCs in parallel, since right now it's serial setup, query, teardown for each DN. > Expose disk-location information for blocks to enable better scheduling > --- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.0-alpha >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: hdfs-3672-1.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira