[jira] [Updated] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2012-02-05 Thread Matt Foley (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated HDFS-2379:
-

Affects Version/s: (was: 1.1.0)
   1.0.0
Fix Version/s: (was: 1.1.0)

 0.20: Allow block reports to proceed without holding FSDataset lock
 ---

 Key: HDFS-2379
 URL: https://issues.apache.org/jira/browse/HDFS-2379
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 1.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: 1.0.1

 Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
 hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt


 As disks are getting larger and more plentiful, we're seeing DNs with 
 multiple millions of blocks on a single machine. When page cache space is 
 tight, block reports can take multiple minutes to generate. Currently, during 
 the scanning of the data directories to generate a report, the FSVolumeSet 
 lock is held. This causes writes and reads to block, timeout, etc, causing 
 big problems especially for clients like HBase.
 This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
 for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-17 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2379:
--

Attachment: hdfs-2379.txt

bq. getBlockReport() javadoc is unnecessary
bq. minor: Request that an block report - Request that a 
bq. Make asyncBlockReport final.
bq. (!requested || scan != null) is better readable as !(requested  scan == 
null)
Fixed

bq. Indentation of String metaPart = ... could be better
Fixed - also added a constant for METADATA_EXTENSION_LENGTH instead of the bare 
constant 5.

bq. Why do you want to deprecate #getBlockInfo()? If you have a valid reason, 
can you please add information on the new method/mechanism that should be used 
instead of the deprecated method.
Answered this in the comment above. It's now removed.

bq. Why are you sleeping for 2 seconds on catching Throwable?
Added this comment:
+// Avoid busy-looping in the case that we have entered some invalid
+// state -- don't want to flood the error log with exceptions.
(my experience in other parts of the DN was that these types of busy loops 
caused big problems)

bq. Optional - This might be a good time to move some of the block reported 
code into a separate method, outside offerService().
I agree it would be a nice cleanup, but wanted to keep this change minimal.


 0.20: Allow block reports to proceed without holding FSDataset lock
 ---

 Key: HDFS-2379
 URL: https://issues.apache.org/jira/browse/HDFS-2379
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.206.0
Reporter: Todd Lipcon
Priority: Critical
 Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
 hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt


 As disks are getting larger and more plentiful, we're seeing DNs with 
 multiple millions of blocks on a single machine. When page cache space is 
 tight, block reports can take multiple minutes to generate. Currently, during 
 the scanning of the data directories to generate a report, the FSVolumeSet 
 lock is held. This causes writes and reads to block, timeout, etc, causing 
 big problems especially for clients like HBase.
 This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
 for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-11 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2379:
--

Attachment: hdfs-2379.txt

Updated patch. I ran a manual test with 4 datanodes where I inserted a few 
hundred blocks, then on one of the datanodes did rm blk_* in one of the data 
directories. I had configured the block report interval to 10 seconds, and on 
the next block report, the NN counted these blocks as under-replicated. After a 
minute or two, everything was fully replicated again.

 0.20: Allow block reports to proceed without holding FSDataset lock
 ---

 Key: HDFS-2379
 URL: https://issues.apache.org/jira/browse/HDFS-2379
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.206.0
Reporter: Todd Lipcon
Priority: Critical
 Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
 hdfs-2379.txt, hdfs-2379.txt


 As disks are getting larger and more plentiful, we're seeing DNs with 
 multiple millions of blocks on a single machine. When page cache space is 
 tight, block reports can take multiple minutes to generate. Currently, during 
 the scanning of the data directories to generate a report, the FSVolumeSet 
 lock is held. This causes writes and reads to block, timeout, etc, causing 
 big problems especially for clients like HBase.
 This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
 for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-10 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2379:
--

Attachment: hdfs-2379.txt

Suresh pointed out an error I made in resolving conflicts in the previous 
patch. This patch is the same except that DataNode needs to call 
retrieveAsyncBlockReport, rather than getBlockReport.

 0.20: Allow block reports to proceed without holding FSDataset lock
 ---

 Key: HDFS-2379
 URL: https://issues.apache.org/jira/browse/HDFS-2379
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.206.0
Reporter: Todd Lipcon
Priority: Critical
 Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
 hdfs-2379.txt


 As disks are getting larger and more plentiful, we're seeing DNs with 
 multiple millions of blocks on a single machine. When page cache space is 
 tight, block reports can take multiple minutes to generate. Currently, during 
 the scanning of the data directories to generate a report, the FSVolumeSet 
 lock is held. This causes writes and reads to block, timeout, etc, causing 
 big problems especially for clients like HBase.
 This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
 for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-10-09 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2379:
--

Attachment: hdfs-2379.txt

Sorry about that. Here's a patch against current branch-0.20-security.

 0.20: Allow block reports to proceed without holding FSDataset lock
 ---

 Key: HDFS-2379
 URL: https://issues.apache.org/jira/browse/HDFS-2379
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.206.0
Reporter: Todd Lipcon
Priority: Critical
 Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt


 As disks are getting larger and more plentiful, we're seeing DNs with 
 multiple millions of blocks on a single machine. When page cache space is 
 tight, block reports can take multiple minutes to generate. Currently, during 
 the scanning of the data directories to generate a report, the FSVolumeSet 
 lock is held. This causes writes and reads to block, timeout, etc, causing 
 big problems especially for clients like HBase.
 This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
 for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-09-28 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2379:
--

Attachment: hdfs-2379.txt

I made this kinda-complicated patch against 20-security, and then realized 
that, in fact, today's current behavior isn't even really synchronized. It's 
synchronized only on FSVolumeSet, which only blocks block allocation (since 
getNextVolume is synchronized). It doesn't block finalization, invalidation, 
etc, while scanning. So maybe we can simply remove the synchronized keyword 
on getBlockInfo?

I'm testing this more complicated one on a cluster now, running teragen with 
256KB blocks, and block report interval dropped to 90s.

 0.20: Allow block reports to proceed without holding FSDataset lock
 ---

 Key: HDFS-2379
 URL: https://issues.apache.org/jira/browse/HDFS-2379
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.206.0
Reporter: Todd Lipcon
Priority: Critical
 Attachments: hdfs-2379.txt


 As disks are getting larger and more plentiful, we're seeing DNs with 
 multiple millions of blocks on a single machine. When page cache space is 
 tight, block reports can take multiple minutes to generate. Currently, during 
 the scanning of the data directories to generate a report, the FSVolumeSet 
 lock is held. This causes writes and reads to block, timeout, etc, causing 
 big problems especially for clients like HBase.
 This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
 for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-09-28 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2379:
--

Attachment: hdfs-2379.txt

this is an expanded patch which does the following:

- introduces a method in FSDataSet called roughBlockScan() which scans the data 
directories but takes no locks and guarantees no consistency. Another method, 
reconcileRoughBlockScan, is used to diff it against the in-memory state, and 
find any cases where concurrent modifications need to be accounted for

- introduces a simple inner class which calls roughBlockScan in a thread, and 
then allows a caller to poll for when the results are ready

- the DN heartbeat loop now triggers a block report when the block report 
interval is expired, but doesn't block on the report being ready. Instead, it 
just triggers the thread to start calculating the rough report. Once the 
rough report is ready, it finishes the report by reconciling it while holding 
the lock.

I've been testing this by doing concurrent terasort + hbase on a cluster where 
I prepopulated with a few hundred thousand small blocks per node (using teragen 
with 256KB block size). I've set the block report interval to 90seconds. With 
no load, the block reports take some 15 seconds to generate. With heavy MR 
load, they take up to 8-10 minutes. Without the patch, I had lots of 
SocketTimeoutExceptions, etc - with the patch, the block reports proceed in the 
background and the MR job succeeds without failed tasks.

 0.20: Allow block reports to proceed without holding FSDataset lock
 ---

 Key: HDFS-2379
 URL: https://issues.apache.org/jira/browse/HDFS-2379
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.206.0
Reporter: Todd Lipcon
Priority: Critical
 Attachments: hdfs-2379.txt, hdfs-2379.txt


 As disks are getting larger and more plentiful, we're seeing DNs with 
 multiple millions of blocks on a single machine. When page cache space is 
 tight, block reports can take multiple minutes to generate. Currently, during 
 the scanning of the data directories to generate a report, the FSVolumeSet 
 lock is held. This causes writes and reads to block, timeout, etc, causing 
 big problems especially for clients like HBase.
 This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
 for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira