[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2012-02-05 Thread Matt Foley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201059#comment-13201059
 ] 

Matt Foley commented on HDFS-2379:
--

Thanks, Suresh!

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 1.1.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.1.0, 1.0.1
>
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2012-02-03 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200026#comment-13200026
 ] 

Todd Lipcon commented on HDFS-2379:
---

I'm currently pretty slammed with work on the HA branch at the moment, so it 
would probably be a couple weeks before I could get to this, sorry.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 1.1.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.1.0
>
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-11-01 Thread Suresh Srinivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141905#comment-13141905
 ] 

Suresh Srinivas commented on HDFS-2379:
---

Change looks good. +1.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-25 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135264#comment-13135264
 ] 

Todd Lipcon commented on HDFS-2379:
---

Suresh: do you have any further comments on this or can I commit pending 
test-patch results?

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-11 Thread Suresh Srinivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125428#comment-13125428
 ] 

Suresh Srinivas commented on HDFS-2379:
---

bq. Why do you choose to notifyAll when requested is set to true, but not when 
scan is set to null or requested is set to false?
Ignore this comment.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-11 Thread Suresh Srinivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125402#comment-13125402
 ] 

Suresh Srinivas commented on HDFS-2379:
---

FSDatasetInterface.java
# getBlockReport() javadoc is unnecessary.
# minor: "Request that an block report" -> "Request that a "
# retrieveAsyncBlockReport - javadoc is not very clear. Also change to javadoc 
of getBlockReport() is not necessary.

FSDataset.java
# Indentation of {{String metaPart = ...}} could be better.
# Why do you want to deprecate #getBlockInfo()? If you have a valid reason, can 
you please add information on the new method/mechanism that should be used 
instead of the deprecated method.
# Make asyncBlockReport final.
# Why do you choose to notifyAll when requested is set to true, but not when 
scan is set to null or requested is set to false?
# AsyncBlockReport#run - Why are you sleeping for 2 seconds on catching 
Throwable?
#  (!requested || scan != null) is better readable as !(requested && scan == 
null)

Datanode.java
# Optional - This might be a good time to move some of the block reported code 
into a separate method, outside offerService().



> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-11 Thread Suresh Srinivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125383#comment-13125383
 ] 

Suresh Srinivas commented on HDFS-2379:
---

I was reviewing the older version of the patch (thanks to default sort order 
for attachments), will post comments soon.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-11 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125314#comment-13125314
 ] 

Todd Lipcon commented on HDFS-2379:
---

bq. In the new version of the patch, FSDatasetInterface.java changes are 
missing. Also asynchronous scan thread change is missing as well. Want to make 
sure that it is intentional.

Not sure what you mean - I see them there.

bq. There are some lines that are more than 80 chars.
Fixed

bq. Why do you want to deprecate #getBlockInfo()? If you have a valid reason, 
can you please add information on the new method/mechanism that should be used 
instead of the deprecated method.
These methods were left only for the sake of the sanity-check code path. But 
given the below comment, I've removed both the sanity check code path and the 
getBlockInfo method, since that's the only spot it was used. (and it was 
private)

bq. SANITY_CHECK code can be removed.
Fixed

bq. What happens to cases when volumeMap contains block but scanned block File 
does not exist or scanned block file exists but volumeMap does not contain it?

The goal of this JIRA is to preserve the existing semantics - ie to produce an 
identical block report as to what would have been produced if the whole scan 
had happened while under the lock. So:

- If the block is in memory, but not on disk, the block is not reported. Note 
that we re-check the existence on disk when we see this situation, to make sure 
it wasn't just that the block was added after the scan. This code path handles 
the case where an administrator accidentally rm -Rfs some blocks - we want to 
make sure they don't show up in the block report, so that the NN can 
re-replicate.
- If the block is on disk, but not in memory, we do report it, but only after 
checking that it's still there (with the lock held).
I've updated the comments in the code to clarify the above behaviors.

In the above cases, it might make some sense to actually update the in-memory 
map based on what was seen on disk. But, that would change the semantics, which 
would be harder to verify.

bq. In the end, the scanned block info is made to look same as the in memory 
state. I am just wondering, what is the need of the scan then?
The scan is made to look the same as the disk state. Anything places where we 
see a diff vs memory, we then _recheck_ the disk for those blocks while holding 
the lock. So the semantics should be the same as before.

Will upload another patch momentarily with the above fixes. I'll also run 
through a basic manual test scenario of rm -Rfing some blocks and making sure 
they get re-replicated.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-11 Thread Suresh Srinivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125193#comment-13125193
 ] 

Suresh Srinivas commented on HDFS-2379:
---

Comments:
# In the new version of the patch, FSDatasetInterface.java changes are missing. 
Also asynchronous scan thread change is missing as well. Want to make sure that 
it is intentional.
#* There are some lines that are more than 80 chars.
# FSDataset.java
#* Why do you want to deprecate #getBlockInfo()? If you have a valid reason, 
can you please add information on the new method/mechanism that should be used 
instead of the deprecated method.
#* Add javadoc to scanBlockFilesInconsistent() - add info about why it is not 
synchronized.
#* SANITY_CHECK code can be removed.
#* reconcileInconsistentDiskScan
#** What happens to cases when volumeMap contains block but scanned block File 
does not exist or scanned block file exists but volumeMap does not contain it?
#** In the end, the scanned block info is made to look same as the in memory 
state. I am just wondering, what is the need of the scan then?


> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-09 Thread Suresh Srinivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123835#comment-13123835
 ] 

Suresh Srinivas commented on HDFS-2379:
---

Todd, this patch does not apply to the latest 20-security branch.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-03 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119409#comment-13119409
 ] 

Todd Lipcon commented on HDFS-2379:
---

I've been testing this latest patch for several days now, with terasorts, 
teragens, etc, and no issues to be seen. I tested both with a small number of 
blocks per node, and with a large number per node (by adding a few TB with 
256KB block size)

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-09-29 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117073#comment-13117073
 ] 

Todd Lipcon commented on HDFS-2379:
---

We have some customers who have lots of small blocks (unfortunately they don't 
make good use of HAR). So, a single drive may have 400k+ blocks. When there's a 
lot of page cache pressure and the dentry/inode caches get pushed out, we're 
seeing it take several minutes per drive to do the scan. I've been 
experimenting with tuning /proc/sys/vm/vfs_cache_pressure which seems to help 
some, but even still it's taking many seconds when under lots of load. (eg in 
the middle of a terasort)

It was a little tricky to get right, but this patch includes a "sanity check" 
mode which I used to catch several bugs. I think given that, today, we don't 
even properly synchronize it, the chance that this introduces more bugs is low. 
Still, I'm running some continuous cluster tests with this patch -- HBase write 
workloads with block report interval 90s. This shuffles through a lot of blocks 
quickly and helped me find some issues while working on the patch.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-09-28 Thread Hairong Kuang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117029#comment-13117029
 ] 

Hairong Kuang commented on HDFS-2379:
-

Our HDFS cluster scans disks in parallel when generating block reports. For a 
24 TB node with 12 disks, its block report could be generated in about 1 
second. Would this help eliminate SocketTimeoutException? I like your idea. The 
trick is that it is hard to get it right.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-09-28 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116304#comment-13116304
 ] 

Todd Lipcon commented on HDFS-2379:
---

As discussed in the above-referenced JIRA, I think we can do something like the 
following pseudocode:

{code}
Set blocksFoundByScan = inconsistentScanVolume(); // ignore any 
file-not-founds we get due to concurrent FS modifications
synchronized (volume) {
  Set missingFromScan = Sets.difference(volumeMap.keySet(), 
blocksFoundByScan);
  Set missingFromMem = Sets.difference(blocksFoundByScan, 
volumeMap.keySet());
  for (Block b : missingFromScan) { // block is in memory but not in scan
if (b exists on disk) {
  // it got added after we scanned that part of the tree!
  add it to block report
}
  }
  for (Block b : missingFromMem) { // block was on disk but not in memory
if (b no longer exists on disk) {
   // remove from block report - it was deleted after we scanned that part
}
  }
}
{code}

Anyone see a reason why this wouldn't work? Basically the idea is to do a 
"rough sketch" scan first, then anywhere we detect inconsistency, we touch it 
up, while holding the lock.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira