[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-17 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3672:
-

   Resolution: Fixed
Fix Version/s: 2.2.0-alpha
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've just committed this to trunk and branch-2. Thanks a lot for the 
contribution, Andrew, and thanks a lot to Suresh, Arun, et al for the 
discussion.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Fix For: 2.2.0-alpha
>
> Attachments: design-doc-v1.pdf, design-doc-v2.pdf, 
> hdfs-3672-10.patch, hdfs-3672-11.patch, hdfs-3672-12.patch, 
> hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, hdfs-3672-4.patch, 
> hdfs-3672-5.patch, hdfs-3672-6.patch, hdfs-3672-7.patch, hdfs-3672-8.patch, 
> hdfs-3672-9.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-16 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-12.patch

My bad, should have followed that hostname change a little more closely. Now it 
passes the conf parameter down to be properly obeyed by the RPC threads.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, design-doc-v2.pdf, 
> hdfs-3672-10.patch, hdfs-3672-11.patch, hdfs-3672-12.patch, 
> hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, hdfs-3672-4.patch, 
> hdfs-3672-5.patch, hdfs-3672-6.patch, hdfs-3672-7.patch, hdfs-3672-8.patch, 
> hdfs-3672-9.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-16 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-11.patch

Rebase try 2. A lesson about compile testing after a rebase has been learned.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, design-doc-v2.pdf, 
> hdfs-3672-10.patch, hdfs-3672-11.patch, hdfs-3672-1.patch, hdfs-3672-2.patch, 
> hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch, 
> hdfs-3672-7.patch, hdfs-3672-8.patch, hdfs-3672-9.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-16 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-10.patch

Rebase patch on trunk.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, design-doc-v2.pdf, 
> hdfs-3672-10.patch, hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, 
> hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch, hdfs-3672-7.patch, 
> hdfs-3672-8.patch, hdfs-3672-9.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-13 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-9.patch

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, design-doc-v2.pdf, hdfs-3672-1.patch, 
> hdfs-3672-2.patch, hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, 
> hdfs-3672-6.patch, hdfs-3672-7.patch, hdfs-3672-8.patch, hdfs-3672-9.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-08 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: design-doc-v2.pdf

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, design-doc-v2.pdf, hdfs-3672-1.patch, 
> hdfs-3672-2.patch, hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, 
> hdfs-3672-6.patch, hdfs-3672-7.patch, hdfs-3672-8.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-07 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-8.patch

Nit addressed.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, 
> hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch, 
> hdfs-3672-7.patch, hdfs-3672-8.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-07 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-7.patch

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, 
> hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch, 
> hdfs-3672-7.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-06 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-6.patch

Thanks for the detailed review ATM, I tried to address all your comments.

I broke out the huge DFSClient method into a few smaller ones, which are still 
a bit large but logically sound. I can try to go further with this, but it'll 
mean passing more stuff in parameters.

The config option I added ("dfs.client.file-block-locations.enabled") is 
default off, and checked client-side only. I could add this to the DN side too 
if we want to be really sure.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, 
> hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch, hdfs-3672-6.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-03 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-5.patch

Another patch rev, basically just doing stylistic cleanups. I haven't heard any 
code-related feedback in a while, so I haven't changed any classnames or added 
any conf options.

I've tried to satisfy all the comments thus far, and I would really like to get 
this in soon if possible. Happy to listen to any further feedback about what I 
can do to make this happen.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, 
> hdfs-3672-3.patch, hdfs-3672-4.patch, hdfs-3672-5.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-08-01 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: design-doc-v1.pdf

Attaching design doc detailing the usecases, and trying to plot out the future 
direction. Happy to expand on anything unclear.

Overall, I feel like there's strong interest in the API from multiple parties 
(the unnamed Cloudera customer, HBase, MR), and fairly clear potential 
performance improvements. I'd appreciate any advice on making it crystal clear 
to downstream users that this is an unstable API. We've already got the 
appropriate annotations, and I could also make it require a config option 
before doing anything useful (which I think satisfies "default off"). Any other 
suggestions?

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: design-doc-v1.pdf, hdfs-3672-1.patch, hdfs-3672-2.patch, 
> hdfs-3672-3.patch, hdfs-3672-4.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-07-31 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-4.patch

Fix findbugs. I also parallelized the DN RPCs with Callables and a threadpool.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, 
> hdfs-3672-4.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-07-30 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Status: Patch Available  (was: Open)

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-07-26 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-3.patch

Nuked the method from {{FileSystem}}, and did some small cleanups. I also added 
{{InterfaceStability.Unstable}} annotations on the new classes in hadoop-common.

Still open to renaming suggestions, if desired. I'd like to keep them named 
{{.*BlockLocation}} for consistency, because they are subclasses of 
{{BlockLocation}}.

Perhaps {{HdfsBlockLocation}} for the {{DistributedFileSystem}} API, and 
{{LocatedBlockLocation}} for the internal {{LocatedBlock}} wrapper?

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-07-25 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-2.patch

Newer version of the patch, addressing Todd and Tom's comments.

One unfortunate bit is that to split the NN and DN RPCs, I needed to add a 
subclass of {{BlockLocation}} that hides a corresponding {{LocatedBlock}}. An 
array of these {{HdfsBlockLocation}} is now returned by 
{{DFS#getFileBlockLocations}}, and downcasted in 
{{DFSClient#getDiskBlockLocations}} to retrieve the {{LocatedBlock}}.

I took Todd's advice about {{Integer.MAX_VALUE}} for denoting invalid blocks, 
but turn it a boolean accessible via {{DiskId#isValid}} before it's shown to 
consumers of {{FileSystem}}.

Finally, I already renamed things to {{DiskBlockLocation}} based on Tom's 
comment, and reused the name {{HdfsBlockLocation}} for my {{LocatedBlock}} 
wrapper class. I can re-rename both of these if we don't like them.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

2012-07-19 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-3672:
--

Attachment: hdfs-3672-1.patch

First hack at this. I still want to add some more tests, but I think the design 
is about right.

This essentially provides the same API as {{DFS#getFileBlockLocations}}, except 
it returns a subclass of {{BlockLocation}}, {{HdfsBlockLocation}}, which has an 
additional array of {{byte}}s which is an opaque identifier that specifies on 
which disk on a datanode the block resides.

Currently, this ID is mapped to the index of the HDFS data directory containing 
the block file (e.g. /data/1, /data/2). This can thus change across 
reboots/config changes, and clients need to be prepared to requery anyway since 
blocks do move around as part of normal operation.

I'd like to perhaps split the new {{DFS#getFileHdfsBlockLocations}} function 
into a call to {{DFS#getFileBlockLocations}} to do the NN query to get block 
locations, and then pass these to some other call ({{DFS#getDiskIds}}?), since 
this would let you do multiple calls to {{DFS#getFileBlockLocations}} and then 
do one series of RPCs to the datanodes. But, I need to figure out how to change 
the {{BlockLocation[]}} back into a {{LocatedBlock[]}}.

It might also be nice to do the DN RPCs in parallel, since right now it's 
serial setup, query, teardown for each DN.

> Expose disk-location information for blocks to enable better scheduling
> ---
>
> Key: HDFS-3672
> URL: https://issues.apache.org/jira/browse/HDFS-3672
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.0-alpha
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-3672-1.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira