[jira] [Commented] (HDFS-1401) TestFileConcurrentReader test case is still timing out / failing
[ https://issues.apache.org/jira/browse/HDFS-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038119#comment-13038119 ] sam rash commented on HDFS-1401: see todd's find in: https://issues.apache.org/jira/browse/HDFS-1057 TestFileConcurrentReader test case is still timing out / failing Key: HDFS-1401 URL: https://issues.apache.org/jira/browse/HDFS-1401 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs client Affects Versions: 0.22.0 Reporter: Tanping Wang Priority: Critical Attachments: HDFS-1401.patch The unit test case, TestFileConcurrentReader after its most recent fix in HDFS-1310 still times out when using java 1.6.0_07. When using java 1.6.0_07, the test case simply hangs. On apache Hudson build ( which possibly is using a higher sub-version of java) this test case has presented an inconsistent test result that it sometimes passes, some times fails. For example, between the most recent build 423, 424 and build 425, there is no effective change, however, the test case failed on build 424 and passed on build 425 build 424 test failed https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/424/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/ build 425 test passed https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/425/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036343#comment-13036343 ] sam rash commented on HDFS-1057: if it helps, there is only ever 1 writer + 1 reader in the test. 1 reader 'tails' by opening and closing the file repeatedly, up to 1000 times (hence exposing socket leaks in the past) Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append, 0.21.0, 0.22.0 Attachments: HDFS-1057-0.20-append.patch, conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, hdfs-1057-trunk-6.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036461#comment-13036461 ] sam rash commented on HDFS-1057: todd: thanks for digging into this Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append, 0.21.0, 0.22.0 Attachments: HDFS-1057-0.20-append.patch, conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, hdfs-1057-trunk-6.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035851#comment-13035851 ] sam rash commented on HDFS-1057: the last time I debugged the test failure, it exposed a socket/fd leak in a completely unrelated part of the code. The test failing here also has 0 to do with the added feature--because it closes/opens files in rapid succession, it is prone expose resource leaks. Removing this test (or feature) won't take away the underlying problem that should be looked at. Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append, 0.21.0, 0.22.0 Attachments: HDFS-1057-0.20-append.patch, conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, hdfs-1057-trunk-6.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035868#comment-13035868 ] sam rash commented on HDFS-1057: the test opens/closes files for read/write. that exposed a slow leak last time. I suggest anyone concerned with resources leaks in hadoop investigate. we don't see the test failure in our open-sourced 0.20 fork removing the test is an option; or coming up with a better one (this was my first hdfs feature + test). Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append, 0.21.0, 0.22.0 Attachments: HDFS-1057-0.20-append.patch, conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, hdfs-1057-trunk-6.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035884#comment-13035884 ] sam rash commented on HDFS-1057: i assume a similar problem as before. The problem was that code that opened RPC proxies to DNs did not get closed in a finally block. The test failure output indicates a socket/fd leak (Too many open files). https://issues.apache.org/jira/browse/HDFS-1310 the test was succeeding 8 months ago, 2010-09-10, so I'd look at commits that came after that. Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append, 0.21.0, 0.22.0 Attachments: HDFS-1057-0.20-append.patch, conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, hdfs-1057-trunk-6.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035924#comment-13035924 ] sam rash commented on HDFS-1057: Todd: on a different issue, one test in here that looks suspicious is the testImmediateReadOfNewFile. It repeatedly opens and closes a file right away, requiring 1k successful opens (at least in our copy) Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append, 0.21.0, 0.22.0 Attachments: HDFS-1057-0.20-append.patch, conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, hdfs-1057-trunk-6.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-941) Datanode xceiver protocol should allow reuse of a connection
[ https://issues.apache.org/jira/browse/HDFS-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13021862#comment-13021862 ] sam rash commented on HDFS-941: --- The last failure I saw with this test was basically unrelated to the test itself--it was a socket leak in the datanode, i think with RPCs. I glanced at the first test failure output and found a similar error: 2011-04-11 21:29:36,962 INFO datanode.DataNode (DataXceiver.java:opWriteBlock(458)) - writeBlock blk_-6878114854540472276_1001 received exception java.io.FileNotFoundException: /grid/0/hudson/hudson-slave/workspace/PreCommit-HDFS-Build/trunk/build/test/data/dfs/data/data1/current/rbw/blk_-6878114854540472276_1001.meta (Too many open files) Note that this test implicitly finds any socket/fd leaks because it opens/closes files repeatedly. If you can check into this, that'd be great. I'll have some more time later this week to help more. Datanode xceiver protocol should allow reuse of a connection Key: HDFS-941 URL: https://issues.apache.org/jira/browse/HDFS-941 Project: Hadoop HDFS Issue Type: Improvement Components: data-node, hdfs client Affects Versions: 0.22.0 Reporter: Todd Lipcon Assignee: bc Wong Attachments: HDFS-941-1.patch, HDFS-941-2.patch, HDFS-941-3.patch, HDFS-941-3.patch, HDFS-941-4.patch, HDFS-941-5.patch, HDFS-941-6.patch, HDFS-941-6.patch, HDFS-941-6.patch, hdfs941-1.png Right now each connection into the datanode xceiver only processes one operation. In the case that an operation leaves the stream in a well-defined state (eg a client reads to the end of a block successfully) the same connection could be reused for a second operation. This should improve random read performance significantly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001134#comment-13001134 ] sam rash commented on HDFS-1057: does this test include the patch: https://issues.apache.org/jira/browse/HADOOP-6907 ? Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append, 0.21.0, 0.22.0 Attachments: HDFS-1057-0.20-append.patch, conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, hdfs-1057-trunk-6.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1403) add -truncate option to fsck
[ https://issues.apache.org/jira/browse/HDFS-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913908#action_12913908 ] sam rash commented on HDFS-1403: can you elaborate? also, this truncate option will have to work on open files. I think -list-corruptfiles only works on closed ones. we have to handle the missing last block problem (the main reason I filed this) add -truncate option to fsck Key: HDFS-1403 URL: https://issues.apache.org/jira/browse/HDFS-1403 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client, name-node Reporter: sam rash When running fsck, it would be useful to be able to tell hdfs to truncate any corrupt file to the last valid position in the latest block. Then, when running hadoop fsck, an admin can cleanup the filesystem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1403) add -truncate option to fsck
add -truncate option to fsck Key: HDFS-1403 URL: https://issues.apache.org/jira/browse/HDFS-1403 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client, name-node Reporter: sam rash When running fsck, it would be useful to be able to tell hdfs to truncate any corrupt file to the last valid position in the latest block. Then, when running hadoop fsck, an admin can cleanup the filesystem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1310) TestFileConcurrentReader fails
[ https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906912#action_12906912 ] sam rash commented on HDFS-1310: my apologies for the delay--i came down with a cold right before the long weekend results of test-patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system tests framework. The patch passed system tests framework compile. TestFileConcurrentReader fails -- Key: HDFS-1310 URL: https://issues.apache.org/jira/browse/HDFS-1310 Project: Hadoop HDFS Issue Type: Test Affects Versions: 0.22.0 Reporter: Suresh Srinivas Assignee: sam rash Attachments: hdfs-1310-1.txt, hdfs-1310-2.txt For details of test failure see http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1310) TestFileConcurrentReader fails
[ https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906964#action_12906964 ] sam rash commented on HDFS-1310: I have not run it all the way through yet. Is it 'test' or 'test-core' these days? TestFileConcurrentReader fails -- Key: HDFS-1310 URL: https://issues.apache.org/jira/browse/HDFS-1310 Project: Hadoop HDFS Issue Type: Test Affects Versions: 0.22.0 Reporter: Suresh Srinivas Assignee: sam rash Attachments: hdfs-1310-1.txt, hdfs-1310-2.txt For details of test failure see http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1310) TestFileConcurrentReader fails
[ https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906105#action_12906105 ] sam rash commented on HDFS-1310: is that just ant test ? not familiar with test-patch ? TestFileConcurrentReader fails -- Key: HDFS-1310 URL: https://issues.apache.org/jira/browse/HDFS-1310 Project: Hadoop HDFS Issue Type: Test Affects Versions: 0.22.0 Reporter: Suresh Srinivas Assignee: sam rash Attachments: hdfs-1310-1.txt, hdfs-1310-2.txt For details of test failure see http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1310) TestFileConcurrentReader fails
[ https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1310: --- Attachment: hdfs-1310-2.txt create ClientDatanodeProtocol in try{} block so that we don't skip checking additional DNs on an exception TestFileConcurrentReader fails -- Key: HDFS-1310 URL: https://issues.apache.org/jira/browse/HDFS-1310 Project: Hadoop HDFS Issue Type: Test Affects Versions: 0.22.0 Reporter: Suresh Srinivas Assignee: sam rash Attachments: hdfs-1310-1.txt, hdfs-1310-2.txt For details of test failure see http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1310) TestFileConcurrentReader fails
[ https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905380#action_12905380 ] sam rash commented on HDFS-1310: good point, i'll move the init into the try block TestFileConcurrentReader fails -- Key: HDFS-1310 URL: https://issues.apache.org/jira/browse/HDFS-1310 Project: Hadoop HDFS Issue Type: Test Affects Versions: 0.22.0 Reporter: Suresh Srinivas Assignee: sam rash Attachments: hdfs-1310-1.txt For details of test failure see http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1310) TestFileConcurrentReader fails
[ https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1310: --- Attachment: hdfs-1310-1.txt Datanode RPC proxy created stopped properly TestFileConcurrentReader fails -- Key: HDFS-1310 URL: https://issues.apache.org/jira/browse/HDFS-1310 Project: Hadoop HDFS Issue Type: Test Affects Versions: 0.22.0 Reporter: Suresh Srinivas Assignee: sam rash Attachments: hdfs-1310-1.txt For details of test failure see http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904455#action_12904455 ] sam rash commented on HDFS-1057: I'm confused--the jira for the test result indicated you had solved the problem. Can you let me know what you need me to do? Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append, 0.21.0, 0.22.0 Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, hdfs-1057-trunk-6.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HDFS-1310) TestFileConcurrentReader fails
[ https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash reassigned HDFS-1310: -- Assignee: sam rash TestFileConcurrentReader fails -- Key: HDFS-1310 URL: https://issues.apache.org/jira/browse/HDFS-1310 Project: Hadoop HDFS Issue Type: Test Affects Versions: 0.22.0 Reporter: Suresh Srinivas Assignee: sam rash For details of test failure see http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1310) TestFileConcurrentReader fails
[ https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904472#action_12904472 ] sam rash commented on HDFS-1310: sorry for the delay, I skimmed this jira and the last comment contradicted the title, so I assumed it was in ok shape. I will have a minute to look at this in more detail tomorrow night TestFileConcurrentReader fails -- Key: HDFS-1310 URL: https://issues.apache.org/jira/browse/HDFS-1310 Project: Hadoop HDFS Issue Type: Test Affects Versions: 0.22.0 Reporter: Suresh Srinivas Assignee: sam rash For details of test failure see http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1310) TestFileConcurrentReader fails
[ https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904506#action_12904506 ] sam rash commented on HDFS-1310: actually the 2nd two are likely fallout from the first--if it died and didn't cleanup the locks, this could happen. as i noted, i'm a bit short on time tonight, so I'll get to this tomorrow evening. fwiw, this looks familiar--the 'too many open files' with this unit test. i thought i already saw this and fixed it, though, where I simply didn't close a file in a thread...maybe only patched it in our local branch. thanks for the directly links to the results TestFileConcurrentReader fails -- Key: HDFS-1310 URL: https://issues.apache.org/jira/browse/HDFS-1310 Project: Hadoop HDFS Issue Type: Test Affects Versions: 0.22.0 Reporter: Suresh Srinivas Assignee: sam rash For details of test failure see http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1350) make datanodes do graceful shutdown
[ https://issues.apache.org/jira/browse/HDFS-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901483#action_12901483 ] sam rash commented on HDFS-1350: actually sorry, i remember your patch to do this. I think the rev I'm using internally is older--i will check 20-append as well. This problem may not appear in 20-append with all the latest patches. the question still remains if there is benefit in making the datanode do a clean shutdown on the DataXceiver threads. make datanodes do graceful shutdown --- Key: HDFS-1350 URL: https://issues.apache.org/jira/browse/HDFS-1350 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: sam rash Assignee: sam rash we found that the Datanode doesn't do a graceful shutdown and a block can be corrupted (data + checksum amounts off) we can make the DN do a graceful shutdown in case there are open files. if this presents a problem to a timely shutdown, we can make a it a parameter of how long to wait for the full graceful shutdown before just exiting -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1350) make datanodes do graceful shutdown
[ https://issues.apache.org/jira/browse/HDFS-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901527#action_12901527 ] sam rash commented on HDFS-1350: actually, it's in our latest branch here which is = 20-append and includes your patch. The problem is that getBlockMetaDataInfo() has this at the end: {code} // paranoia! verify that the contents of the stored block // matches the block file on disk. data.validateBlockMetadata(stored); {code} which includes this check: {code} if (f.length() maxDataSize || f.length() = minDataSize) { throw new IOException(Block + b + is of size + f.length() + but has + (numChunksInMeta + 1) + checksums and each checksum size is + checksumsize + bytes.); } {code} a block is not allowed to participate in lease recovery if this fails. make datanodes do graceful shutdown --- Key: HDFS-1350 URL: https://issues.apache.org/jira/browse/HDFS-1350 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: sam rash Assignee: sam rash we found that the Datanode doesn't do a graceful shutdown and a block can be corrupted (data + checksum amounts off) we can make the DN do a graceful shutdown in case there are open files. if this presents a problem to a timely shutdown, we can make a it a parameter of how long to wait for the full graceful shutdown before just exiting -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1262: --- Attachment: hdfs-1262-5.txt address todd's comments (except for RPC compatibility--pending discussion) Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt, hdfs-1262-3.txt, hdfs-1262-4.txt, hdfs-1262-5.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1350) make datanodes do graceful shutdown
[ https://issues.apache.org/jira/browse/HDFS-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901269#action_12901269 ] sam rash commented on HDFS-1350: I saw the case of a single replica existing that did not have a matching data + checksum length. it was not used and we lost the block. i need to double-check the code to see, but the DN exception was that the block was not valid and couldn't be used it seems to me the logic is simple: take the longest length you can get. It doesn't matter if data and checksum match as far as I can tell (though I think typically matching = longer than unmatching). truncation only happens after the NN picks the length of the blocks. as I said, I think the bug, at least in our patched rev (need to look at stock 20-append), is that mismatching lengths can't participate at all in lease recovery which seems broken make datanodes do graceful shutdown --- Key: HDFS-1350 URL: https://issues.apache.org/jira/browse/HDFS-1350 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: sam rash Assignee: sam rash we found that the Datanode doesn't do a graceful shutdown and a block can be corrupted (data + checksum amounts off) we can make the DN do a graceful shutdown in case there are open files. if this presents a problem to a timely shutdown, we can make a it a parameter of how long to wait for the full graceful shutdown before just exiting -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1350) make datanodes do graceful shutdown
make datanodes do graceful shutdown --- Key: HDFS-1350 URL: https://issues.apache.org/jira/browse/HDFS-1350 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: sam rash Assignee: sam rash we found that the Datanode doesn't do a graceful shutdown and a block can be corrupted (data + checksum amounts off) we can make the DN do a graceful shutdown in case there are open files. if this presents a problem to a timely shutdown, we can make a it a parameter of how long to wait for the full graceful shutdown before just exiting -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1350) make datanodes do graceful shutdown
[ https://issues.apache.org/jira/browse/HDFS-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901129#action_12901129 ] sam rash commented on HDFS-1350: My understanding of how lease recovery works in 20-append is that on cluster restart, an open file will be recovered by the Namenode. Datanodes will send the longest valid length of the block (ie, if there are 8 bytes of checksum and 1500 data, the valid length is 1024 assuming 512 byte chunk size). The block is then truncated to a valid length. 20-append seems to have a bug that any block where data + checksum length don't match, the block isn't use in lease recovery. the work here might be to fix that? make datanodes do graceful shutdown --- Key: HDFS-1350 URL: https://issues.apache.org/jira/browse/HDFS-1350 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: sam rash Assignee: sam rash we found that the Datanode doesn't do a graceful shutdown and a block can be corrupted (data + checksum amounts off) we can make the DN do a graceful shutdown in case there are open files. if this presents a problem to a timely shutdown, we can make a it a parameter of how long to wait for the full graceful shutdown before just exiting -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901130#action_12901130 ] sam rash commented on HDFS-1262: my apologies for the delay. I've been caught up in some hi-pri bits at work. thanks for the comments. inlined responses #why does abandonFile return boolean? looks like right now it can only return true or throw, may as well make it void, no? good question: I stole abandonBlock() which has the same behavior. It returns true or throws an exception. I was trying to keep it consistent (rather than logical per se). I do prefer the void option as it makes the method more clear. #in the log message in FSN.abandonFile it looks like there's a missing '+ src +' in the second log message #in the log messages, also log the holder argument perhaps will fix #in previous append-branch patches we've been trying to keep RPC compatibility with unpatched 0.20 - ie you can run an updated client against an old NN, with the provision #that it might not fix all the bugs. Given that, maybe we should catch the exception we get if we call abandonFile() and get back an exception indicating the method doesn't #exist? Check out what we did for HDFS-630 backport for example. nice idea, I will check this out #looks like there are some other patches that got conflated into this one - eg testSimultaneousRecoveries is part of another patch on the append branch. hmm, yea, not sure what happened here...weird, I think I applied one of your patches. Which patch is that test from? #missing Apache license on new test file will fix #typo: Excection instead of Exception will fix #(PermissionStatus) anyObject(), might generated an unchecked cast warning - I think you can do Matchers.PermissionStatusanyObject() or some such to avoid the unchecked #cast ah, nice catch, will fix also #given the complexity of the unit test, would be good to add some comments for the general flow of what all the mocks/spys are achieving. I found myself a bit lost in the #abstractions yea, sry, was in a rush b4 vacation to get some test + patch up. It was a bit tricky to get this case going for both create + append; I'll document the case better (at all) Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt, hdfs-1262-3.txt, hdfs-1262-4.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901134#action_12901134 ] sam rash commented on HDFS-1262: re: RPC compatibility. I'm not 100% sure this is a good idea. If we start to enumerate the cases of how a client can interact with the server, bugs seem more likely. It makes sense with a single method, but if RPC changes become interdependent... what's the case that mandates using a new client against an old namenode? is it not possible to use the appropriately versioned client? or is it the case of heterogeneous sets of clusters and simplicity of management with a single client code base? any other thoughts on this? Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt, hdfs-1262-3.txt, hdfs-1262-4.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1346) DFSClient receives out of order packet ack
[ https://issues.apache.org/jira/browse/HDFS-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900105#action_12900105 ] sam rash commented on HDFS-1346: not easily--PacketResponder is a non-static inner class. Constructing a BlockReceiver requires a Datanode instance. If you can harness a Datanode, then you need to stub out the DataInputStream and figure out when to fire a callback (somehow when ack.readFields() reads from the DatainputStream, but not before). I think it's possible, but we haven't had time yet DFSClient receives out of order packet ack -- Key: HDFS-1346 URL: https://issues.apache.org/jira/browse/HDFS-1346 Project: Hadoop HDFS Issue Type: Bug Components: data-node, hdfs client Affects Versions: 0.20-append Reporter: Hairong Kuang Assignee: Hairong Kuang Fix For: 0.20-append Attachments: outOfOrder.patch When running 0.20 patched with HDFS-101, we sometimes see an error as follow: WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-2871223654872350746_21421120java.io.IOException: Responseprocessor: Expecting seq no for block blk_-2871223654872350746_21421120 10280 but received 10281 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2570) This indicates that DFS client expects an ack for packet N, but receives an ack for packet N+1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1342) expose DFSOutputStream.getNumCurrentReplicas() in libhdfs
expose DFSOutputStream.getNumCurrentReplicas() in libhdfs - Key: HDFS-1342 URL: https://issues.apache.org/jira/browse/HDFS-1342 Project: Hadoop HDFS Issue Type: New Feature Components: contrib/libhdfs Reporter: sam rash Assignee: sam rash Priority: Minor DFSOutputStream exposes the number of writers in a pipeline. We should make this callable from libhdfs -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1330) Make RPCs to DataNodes timeout
[ https://issues.apache.org/jira/browse/HDFS-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896684#action_12896684 ] sam rash commented on HDFS-1330: +1 lgtm Make RPCs to DataNodes timeout -- Key: HDFS-1330 URL: https://issues.apache.org/jira/browse/HDFS-1330 Project: Hadoop HDFS Issue Type: New Feature Components: data-node Affects Versions: 0.22.0 Reporter: Hairong Kuang Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsRpcTimeout.patch This jira aims to make client/datanode or datanode/datanode RPC to have a timeout of DataNode#socketTimeout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1252) TestDFSConcurrentFileOperations broken in 0.20-appendj
[ https://issues.apache.org/jira/browse/HDFS-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892444#action_12892444 ] sam rash commented on HDFS-1252: does the patch preserver the essence of the test: a file that is about to be closed is moved and lease recovery should still work (ie, recover blocks that are already finalized on DNs) TestDFSConcurrentFileOperations broken in 0.20-appendj -- Key: HDFS-1252 URL: https://issues.apache.org/jira/browse/HDFS-1252 Project: Hadoop HDFS Issue Type: Test Components: test Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.20-append Attachments: hdfs-1252.txt This test currently has several flaws: - It calls DN.updateBlock with a BlockInfo instance, which then causes java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.hdfs.server.namenode.BlocksMap$BlockInfo.init() in the logs when the DN tries to send blockReceived for the block - It assumes that getBlockLocations returns an up-to-date length block after a sync, which is false. It happens to work because it calls getBlockLocations directly on the NN, and thus gets a direct reference to the block in the blockmap, which later gets updated This patch fixes this test to use the AppendTestUtil functions to intiiate recovery, and generally pass more reliably. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1262: --- Attachment: hdfs-1262-3.txt removed empty file MockitoUtil Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt, hdfs-1262-3.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1262: --- Attachment: hdfs-1262-4.txt fixed bug where calling append() to trigger lease recovery resulted in a client-side exception (trying to abandon a file that you don't own lease on). DFSClient now catches this exception and logs it Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt, hdfs-1262-3.txt, hdfs-1262-4.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883828#action_12883828 ] sam rash commented on HDFS-1262: one note: {code} public void updateRegInfo(DatanodeID nodeReg) { name = nodeReg.getName(); infoPort = nodeReg.getInfoPort(); // update any more fields added in future. } {code} should be: {code} public void updateRegInfo(DatanodeID nodeReg) { name = nodeReg.getName(); infoPort = nodeReg.getInfoPort(); ipcPort = nodeReg.getIpcPort(); // update any more fields added in future. } {code} it wasn't copying the ipcPort for some reason. My patch includes this fix trunk doesn't have this bug Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883830#action_12883830 ] sam rash commented on HDFS-1262: above is from DatanodeId.java Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883857#action_12883857 ] sam rash commented on HDFS-1262: verified test case passes w/o that patch. we should commit hdfs-894 to 20-append for sure, though. that seems like a potentially gnarly bug in tests to track down (took me a short spell) i can upload the patch w/o the DatanodeID Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883855#action_12883855 ] sam rash commented on HDFS-1262: that's probably better. this was dependent on it as i was killing the datanodes to simulate the pipeline failure. i ended up tuning the test case to use mockito to throw exceptions at the end of a NN rpc call for both append() and create(), so I think that dependency is gone. can we mark this as dependent on that if it turns out to be needed? Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1262: --- Attachment: hdfs-1262-2.txt removed hdfs-894 change from patch (commit this to 0.20-append separately) Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883785#action_12883785 ] sam rash commented on HDFS-1057: from the raw console output of hudson: [exec] [junit] Tests run: 3, Failures: 0, Errors: 1, Time elapsed: 0.624 sec [exec] [junit] Test org.apache.hadoop.hdfs.security.token.block.TestBlockToken FAILED -- [exec] [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.706 sec [exec] [junit] Test org.apache.hadoop.hdfs.server.common.TestJspHelper FAILED -- [exec] [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 28.477 sec [exec] [junit] Test org.apache.hadoop.hdfsproxy.TestHdfsProxy FAILED I ran the tests locally and the first 2 succeed. The third fails on the latest trunk without hdfs-1057. I think from the test perspective, this is safe to commit. 1. TestBlockToken run-test-hdfs: [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/data [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/data [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/logs [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/logs [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/local/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/home/srash/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.hadoop.hdfs.security.token.block.TestBlockToken [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.248 sec 2. TestJspHelper run-test-hdfs: [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/data [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/data [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/logs [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/logs [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/local/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/home/srash/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.hadoop.hdfs.server.common.TestJspHelper [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.275 sec Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, hdfs-1057-trunk-6.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883319#action_12883319 ] sam rash commented on HDFS-1262: in the 2nd case, can't the client still call close? or will it hang forever waiting for blocks? either way, i've got test cases for create() + append() and the fix. took a little longer to clean up today, but will post the patch by end of day Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1262: --- Attachment: hdfs-1262-1.txt -test case for append and create failures. -tried to get it so both cases fail fast, but create will hit the test timeout (default for create that gets AlreadyBeingCreatedException is 5 retries with 60s sleep) -append case fails in 30s w/o the fix worst case Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Attachments: hdfs-1262-1.txt Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883413#action_12883413 ] sam rash commented on HDFS-1057: the one test that failed from my new tests had an fd leak. i've corrected that. the other failed tests I cannot reproduce: 1. org.apache.hadoop.hdfs.TestFileConcurrentReader.testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite -had fd leak, fixed 2. org.apache.hadoop.hdfs.security.token.block.TestBlockToken.testBlockTokenRpc [junit] Running org.apache.hadoop.hdfs.security.token.block.TestBlockToken [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.305 sec 3. org.apache.hadoop.hdfs.server.common.TestJspHelper.testGetUgi [junit] Running org.apache.hadoop.hdfs.server.common.TestJspHelper [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.309 sec I can submit the patch with the fix for #1 plus warning fixes Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882979#action_12882979 ] sam rash commented on HDFS-1262: also, in writing up the test case, i realized DFSClient.create() is not susceptible to the same scenario. While in theory it could happen on the NN side, right now, the namenode RPC for create happens and then all we do is start the streamer (hence i don't have a test case for it yet). I still think having a finally block that calls abandonFile() for create is prudent--if we get any exception in the process client side, abandon the file to be safe Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882609#action_12882609 ] sam rash commented on HDFS-1262: hey, so what should the precise semantics of abandonFile(String src, String holder) be? I have a quick impl now (+ test case) that does this: 1. check that holder owns the lease for src 2. call internalReleaseLeaseOne so it really is a glorified 'cleanup and close' which has the same behavior as if the lease expired--nice and tidy imo. It does have the slight delay of lease recovery, though. an alternative option: for the specific case we are fixing here, we could do something simpler such as just putting the targets in the blockMap and call completeFile (basically what commitBlockSynchronization would do). However, this doesn't handle the general case if we expose abandonFile at any other time and a client has actually written data to last block. I think the first option is safer, but maybe I'm too cautious if the way I've implemented it seems ok, I can post he patch for review asap Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882644#action_12882644 ] sam rash commented on HDFS-1057: sorry, i don't understand. this is a race condition where the namenode has assigned locations to the block, but the client hasn't sent data yet. the NN cannot know that the DNs don't have data on disk yet unless we add additional NN coordination? our choice in this condition is to return 0 or let the exception be. I had done the latter, but you asked for the former unless I misunderstood. can you clarify what you want? Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882664#action_12882664 ] sam rash commented on HDFS-1057: per out offline discussion, it seems the NN doesn't know when the pipeline is created, but the writer does, so the NN has to return the replicas for the current block in this case. I will change it so we check all DNs for a replica before using the default of 0. I need to think about if we require all DNs to have ReplicaNotFound or all have that (versus some other exception). Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1057: --- Attachment: hdfs-1057-trunk-5.txt -returns 0 length only if all DNs are missing the replica (any other io exception will cause client to get exception and it can retry) -my diff viewer does not show any whitespace or indentation changes, but please advise if you see any Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882175#action_12882175 ] sam rash commented on HDFS-1262: we actually use a new FileSystem instance per file in scribe. see http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html there are some downsides to this (creating a new FileSystem instance can be expensive, issuing fork exec calls for 'whoami' and 'groups', but we have patches to minimize this) Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882176#action_12882176 ] sam rash commented on HDFS-1262: i am also wondering why this hasn't shown up in regular create calls sometime. both DFSClient.append() and DFSClient.create() are susceptible to the same problem (client has lease, then throws exception setting up pipeline) Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882263#action_12882263 ] sam rash commented on HDFS-1262: todd: can you confirm if the exception was from the namenode.append() call or creating the output stream? (sounds like the latter, in the lease recovery it initiates) {code} OutputStream append(String src, int buffersize, Progressable progress ) throws IOException { checkOpen(); FileStatus stat = null; LocatedBlock lastBlock = null; try { stat = getFileInfo(src); lastBlock = namenode.append(src, clientName); } catch(RemoteException re) { throw re.unwrapRemoteException(FileNotFoundException.class, AccessControlException.class, NSQuotaExceededException.class, DSQuotaExceededException.class); } OutputStream result = new DFSOutputStream(src, buffersize, progress, lastBlock, stat, conf.getInt(io.bytes.per.checksum, 512)); leasechecker.put(src, result); return result; } {code} either way, i think the right way to do this is add back an abandonFile RPC call in the NN. Even if we don't change function call signatures for abandonBlock, we will break client/server compatibility. thoughts? Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882283#action_12882283 ] sam rash commented on HDFS-1262: i'd appreciate the chance to implement it, actually. Thanks re: the name, according to Dhruba, there used to be one called abandonFile which had the semantics we need. Also, a similar error can occur on non-append creates, so probably having append in the name doesn't make sense. abandonFile or another idea? Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash reassigned HDFS-1262: -- Assignee: sam rash Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery
[ https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882350#action_12882350 ] sam rash commented on HDFS-1186: hey todd, i was looking at this patch, and while it has certainly reduced the chance of problems, isn't it still possible a new writer thread could be created 1. between the kill loop in startBlockRecovery() and the synchronized block 2. between the startBlockRecovery() call and updateBlock() call I seem to recall reasoning with dhruba that while in theory these could occur from the DN perspective, the circumstances that would have to occur outside were not (once you fixed hdfs-1260 anyway, where genstamp checks work right in concurrent lease recovery). what's your take on this? is it full-proof now? (1 2 can't happen) or what about introducing a state like RUR here? (at least disabling writes to a block while under recovery, maybe timing out in case the lease recovery owner dies) 0.20: DNs should interrupt writers at start of recovery --- Key: HDFS-1186 URL: https://issues.apache.org/jira/browse/HDFS-1186 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Attachments: hdfs-1186.txt When block recovery starts (eg due to NN recovering lease) it needs to interrupt any writers currently writing to those blocks. Otherwise, an old writer (who hasn't realized he lost his lease) can continue to write+sync to the blocks, and thus recovery ends up truncating data that has been sync()ed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882359#action_12882359 ] sam rash commented on HDFS-1057: patch should use --no-prefix to get rid of 'a' and 'b' in paths, fyi Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery
[ https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882367#action_12882367 ] sam rash commented on HDFS-1186: yea, i think so. let me repeat slightly different to make sure I get this at a higher level: 1. we make sure that a lease recovery that starts with a old gs at one stage (that's synchronized) actually mutates the block data of only the same gs 2. new writer that come in between start of recovery and actual stamping must have a new gs since they can only come into being via lease recovery this is effectively saying that if concurrent lease recoveries get started, the first to complete wins (as it should), and later completions just fail. sounds like optimistic locking/versioned puts in the cache world actually: updateBlock requires the source to match an expected source. nice idea 0.20: DNs should interrupt writers at start of recovery --- Key: HDFS-1186 URL: https://issues.apache.org/jira/browse/HDFS-1186 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Attachments: hdfs-1186.txt When block recovery starts (eg due to NN recovering lease) it needs to interrupt any writers currently writing to those blocks. Otherwise, an old writer (who hasn't realized he lost his lease) can continue to write+sync to the blocks, and thus recovery ends up truncating data that has been sync()ed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery
[ https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882386#action_12882386 ] sam rash commented on HDFS-1186: how could this happen? the GS=2 stamp succeeds on A and B. for GS=3 to win on C, GS=2 had to fail which means it went 2nd. The primary for GS=2 would get a failure doing the stamping of DN C and would fail the lease recovery, right? 0.20: DNs should interrupt writers at start of recovery --- Key: HDFS-1186 URL: https://issues.apache.org/jira/browse/HDFS-1186 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Attachments: hdfs-1186.txt When block recovery starts (eg due to NN recovering lease) it needs to interrupt any writers currently writing to those blocks. Otherwise, an old writer (who hasn't realized he lost his lease) can continue to write+sync to the blocks, and thus recovery ends up truncating data that has been sync()ed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery
[ https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882389#action_12882389 ] sam rash commented on HDFS-1186: I think you can make this argument: 1. each node has to make a transition from x - x+k 2. at most one node owns any x - x+k transition as the primary of a recovery 3. success requires all DNs to complete x - x+k 4. primary then commits x - x+k and until commitBlockSync completes, no transition y - y+j with y x can come in right? 0.20: DNs should interrupt writers at start of recovery --- Key: HDFS-1186 URL: https://issues.apache.org/jira/browse/HDFS-1186 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Attachments: hdfs-1186.txt When block recovery starts (eg due to NN recovering lease) it needs to interrupt any writers currently writing to those blocks. Otherwise, an old writer (who hasn't realized he lost his lease) can continue to write+sync to the blocks, and thus recovery ends up truncating data that has been sync()ed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery
[ https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882401#action_12882401 ] sam rash commented on HDFS-1186: hmm i wonder why only 1? if the client thinks there are 3 DNs in the pipeline and asks to recovery 3, i think it should fail with less than 3. a client can request fewer if that works (in which case we do have to handle the problem you lay out) so in your sol'n, you are saying that the lease holder, the client, needs to be contacted to verify the primary is the only one doing lease recovery? (or at least the latest) 0.20: DNs should interrupt writers at start of recovery --- Key: HDFS-1186 URL: https://issues.apache.org/jira/browse/HDFS-1186 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Attachments: hdfs-1186.txt When block recovery starts (eg due to NN recovering lease) it needs to interrupt any writers currently writing to those blocks. Otherwise, an old writer (who hasn't realized he lost his lease) can continue to write+sync to the blocks, and thus recovery ends up truncating data that has been sync()ed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery
[ https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882404#action_12882404 ] sam rash commented on HDFS-1186: wait, why can't commitBlockSync on the NN just do the same check on genstamps? if two primaries start concurrent lease recoveries and split the remaining nodes as far as who wins in stamping, and the NN can resolve the issue of who wins in the end? then the loser will be marked as an invalid and replication takes over to fix it or i have this sinking feeling i am still missing something? 0.20: DNs should interrupt writers at start of recovery --- Key: HDFS-1186 URL: https://issues.apache.org/jira/browse/HDFS-1186 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Attachments: hdfs-1186.txt When block recovery starts (eg due to NN recovering lease) it needs to interrupt any writers currently writing to those blocks. Otherwise, an old writer (who hasn't realized he lost his lease) can continue to write+sync to the blocks, and thus recovery ends up truncating data that has been sync()ed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1263) 0.20: in tryUpdateBlock, the meta file is renamed away before genstamp validation is done
[ https://issues.apache.org/jira/browse/HDFS-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881739#action_12881739 ] sam rash commented on HDFS-1263: several thoughts/comments: my reading of the code is that the temp file was to make the creating of a meta file that is both truncated and has the new genstamp and atomic operation on the filesystem. If we rename first and crash and recover, then how do we know that the truncation didn't finish? (without information from the NN or other node giving us a new length). If we truncate first, then we have effectively corrupted a block. can you also explain the error state that results? (truncated blocks, infinite loops, bad meta-data, etc) and do i follow that a client started 2 lease recoveries? or was this a client and a NN somehow? (ie, how were there concurrent recoveries of the same block) seems like extra synchronization in parts of updateBlock might help as well. also, we check the genstamp is moving upwards both at the start of updateBlock and at the end of tryUpdateBlock. do you know why? 0.20: in tryUpdateBlock, the meta file is renamed away before genstamp validation is done - Key: HDFS-1263 URL: https://issues.apache.org/jira/browse/HDFS-1263 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.20-append Saw an issue where multiple datanodes are trying to recover at the same time, and all of them failed. I think the issue is in FSDataset.tryUpdateBlock, we do the rename of blk_B_OldGS to blk_B_OldGS_tmpNewGS and *then* check that the generation stamp is moving upwards. Because of this, invalid update block calls are blocked, but they then cause future updateBlock calls to fail with Meta file not found errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881748#action_12881748 ] sam rash commented on HDFS-1262: i think something along the lines of option 2 sounds cleaner imo. but i have another question, does the error you see have because current leaseholder is trying to recreate file it sounds like this code is executing: {code} // // We found the lease for this file. And surprisingly the original // holder is trying to recreate this file. This should never occur. // if (lease != null) { Lease leaseFile = leaseManager.getLeaseByPath(src); if (leaseFile != null leaseFile.equals(lease)) { throw new AlreadyBeingCreatedException( failed to create file + src + for + holder + on client + clientMachine + because current leaseholder is trying to recreate file.); } } {code} and anytime I see a comment this should never happen it sounds to me like the handling of that might be suboptimal. is there any reason that a client shouldn't be able to open a file in the same mode it already has it open? NN-side, it's a basically a no-op, or a explicit lease renewal. any reason we can't make the above code do that? (log something and return) Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881755#action_12881755 ] sam rash commented on HDFS-1262: in looking at client code, my suggestion above probably isn't a good idea. it would allow concurrent writes. i think the simplest solution is this: add a finally block that removes the path from the LeaseChecker in DFSClient. Then the lease will expire in 60s. Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881757#action_12881757 ] sam rash commented on HDFS-1262: provided there is agreement on the last suggestion, i'm happy to take care of it btw Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881759#action_12881759 ] sam rash commented on HDFS-1262: oop, nevermind--forgot that lease renewal is by client name *only* (hence your option 1) still pondering this a bit, but ~option 2 sounds most appealing. a client should have a way to release the lease it has on a file without necessarily doing a normal close (and hence completeFile) Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881783#action_12881783 ] sam rash commented on HDFS-1262: actually, here's another idea: 3) the NN thinks the client has a lease. it's right. the client just didn't save enough information to handle the failure. namenode.append() just returns the last block. The code in DFSClient: {code} OutputStream result = new DFSOutputStream(src, buffersize, progress, lastBlock, stat, conf.getInt(io.bytes.per.checksum, 512)); leasechecker.put(src, result); return result; {code} if in leasechecker we stored a pair, lastBlock and result (and did so in a finally block): {code} OutputStream result = null; try { result = new DFSOutputStream(src, buffersize, progress, lastBlock, stat, conf.getInt(io.bytes.per.checksum, 512)); } finally { PairLocatedBlock, OutputStream pair = new Pair(lastBlock, result); leasechecker.put(src, pair); return result; } {code} and above, we only call namenode.append() if we don't have a lease already. again, if we do find a solution, i'm happy to help out on this one Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
[ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881795#action_12881795 ] sam rash commented on HDFS-1262: i see, it only solves the re-open by the same client problem, but not the blocking of other clients. the fact is the client does have the lease and currently the only way to release it is via close. in looking at DFSClient.create(), the same problem can occur there. we make a NN rpc call to get a block and acquire a lease. we then create the DFSOutputStream (which could fail) i think that comes back to the need to be able to release a lease without calling namenode.completeFile(). i guess there's not a clever way to do this with existing namenode RPC and/or client initiated lease recovery? Failed pipeline creation during append leaves lease hanging on NN - Key: HDFS-1262 URL: https://issues.apache.org/jira/browse/HDFS-1262 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Priority: Critical Fix For: 0.20-append Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the following: 1) File's original writer died 2) Recovery client tried to open file for append - looped for a minute or so until soft lease expired, then append call initiated recovery 3) Recovery completed successfully 4) Recovery client calls append again, which succeeds on the NN 5) For some reason, the block recovery that happens at the start of append pipeline creation failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase master. HBase assumed the file wasn't open and put it back on a queue to try later 6) Some time later, it tried append again, but the lease was still assigned to the same DFS client, so it wasn't able to recover. The recovery failure in step 5 is a separate issue, but the problem for this JIRA is that the NN can think it failed to open a file for append when the NN thinks the writer holds a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can open or recover the file until the DFS client shuts down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1263) 0.20: in tryUpdateBlock, the meta file is renamed away before genstamp validation is done
[ https://issues.apache.org/jira/browse/HDFS-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881801#action_12881801 ] sam rash commented on HDFS-1263: so if i follow: checking that the genstamp of the file is the one we are trying to update before doing any mutation of blocks or metadata (ie renaming) should fix this issue regarding throwing ioe on concurrent recovery in the same node, that might be problematic if: DN A can talk to DN B, not DN C DN B can talk to DN A and DN C DN A starts recovery first DN B starts after if DN B talks to DN A before DN A times out talking to C, we'll fail a recovery that could succeed, no? i like the idea of failing on these as early in the pipe, but i lean towards fixing the genstamp detection. seems like the whole genstamp process is designed for this--there's just a tiny bug with the rename 0.20: in tryUpdateBlock, the meta file is renamed away before genstamp validation is done - Key: HDFS-1263 URL: https://issues.apache.org/jira/browse/HDFS-1263 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.20-append Saw an issue where multiple datanodes are trying to recover at the same time, and all of them failed. I think the issue is in FSDataset.tryUpdateBlock, we do the rename of blk_B_OldGS to blk_B_OldGS_tmpNewGS and *then* check that the generation stamp is moving upwards. Because of this, invalid update block calls are blocked, but they then cause future updateBlock calls to fail with Meta file not found errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps
[ https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881985#action_12881985 ] sam rash commented on HDFS-1260: about the testing, any reason not to use one of the adapters instead of making this public? {code} public long nextGenerationStampForBlock(Block block) throws IOException { {code} sorry, i'm a stickler for visibility/encapsulation bits when i can be 0.20: Block lost when multiple DNs trying to recover it to different genstamps -- Key: HDFS-1260 URL: https://issues.apache.org/jira/browse/HDFS-1260 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1260.txt Saw this issue on a cluster where some ops people were doing network changes without shutting down DNs first. So, recovery ended up getting started at multiple different DNs at the same time, and some race condition occurred that caused a block to get permanently stuck in recovery mode. What seems to have happened is the following: - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, while the block in the volumeMap (and on filesystem) was genstamp 7093 - we find the block file and meta file based on block ID only, without comparing gen stamp - we rename the meta file to the new genstamp _7094 - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* wildcard GS, so it does *not* update volumeMap - validateBlockMetaData now fails with blk_7739687463244048122_7094 does not exist in blocks map After this point, all future recovery attempts to that node fail in getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock (since the meta file got renamed above) and then fails since _7094 isn't in volumeMap in validateBlockMetadata Making a unit test for this is probably going to be difficult, but doable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps
[ https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881988#action_12881988 ] sam rash commented on HDFS-1260: oh, other than that, lgtm 0.20: Block lost when multiple DNs trying to recover it to different genstamps -- Key: HDFS-1260 URL: https://issues.apache.org/jira/browse/HDFS-1260 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1260.txt Saw this issue on a cluster where some ops people were doing network changes without shutting down DNs first. So, recovery ended up getting started at multiple different DNs at the same time, and some race condition occurred that caused a block to get permanently stuck in recovery mode. What seems to have happened is the following: - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, while the block in the volumeMap (and on filesystem) was genstamp 7093 - we find the block file and meta file based on block ID only, without comparing gen stamp - we rename the meta file to the new genstamp _7094 - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* wildcard GS, so it does *not* update volumeMap - validateBlockMetaData now fails with blk_7739687463244048122_7094 does not exist in blocks map After this point, all future recovery attempts to that node fail in getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock (since the meta file got renamed above) and then fails since _7094 isn't in volumeMap in validateBlockMetadata Making a unit test for this is probably going to be difficult, but doable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1260) 0.20: Block lost when multiple DNs trying to recover it to different genstamps
[ https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881991#action_12881991 ] sam rash commented on HDFS-1260: yea, looks good. at some point, does it make sense to move the DelayAnswer class out? it seems generally useful (not this patch, but just thinking) 0.20: Block lost when multiple DNs trying to recover it to different genstamps -- Key: HDFS-1260 URL: https://issues.apache.org/jira/browse/HDFS-1260 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1260.txt, hdfs-1260.txt Saw this issue on a cluster where some ops people were doing network changes without shutting down DNs first. So, recovery ended up getting started at multiple different DNs at the same time, and some race condition occurred that caused a block to get permanently stuck in recovery mode. What seems to have happened is the following: - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, while the block in the volumeMap (and on filesystem) was genstamp 7093 - we find the block file and meta file based on block ID only, without comparing gen stamp - we rename the meta file to the new genstamp _7094 - in updateBlockMap, we do comparison in the volumeMap by oldblock *without* wildcard GS, so it does *not* update volumeMap - validateBlockMetaData now fails with blk_7739687463244048122_7094 does not exist in blocks map After this point, all future recovery attempts to that node fail in getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock (since the meta file got renamed above) and then fails since _7094 isn't in volumeMap in validateBlockMetadata Making a unit test for this is probably going to be difficult, but doable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
[ https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881249#action_12881249 ] sam rash commented on HDFS-1218: I racked my brain and can't come up with a case that this could actually occur--keepLength is only set true when doing an append. If any nodes had gone down and come back up (RWR), they either have an old genstamp and will be ignored, or soft lease expiry recovery is initiated by the NN with keepLength = false first. i think the idea + patch look good to me (and thanks for taking the time to explain it) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization - Key: HDFS-1218 URL: https://issues.apache.org/jira/browse/HDFS-1218 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1281.txt When a datanode experiences power loss, it can come back up with truncated replicas (due to local FS journal replay). Those replicas should not be allowed to truncate the block during block synchronization if there are other replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1202) DataBlockScanner throws NPE when updated before initialized
[ https://issues.apache.org/jira/browse/HDFS-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881370#action_12881370 ] sam rash commented on HDFS-1202: this looks good. I checked trunk and I think it is needed there also DataBlockScanner throws NPE when updated before initialized --- Key: HDFS-1202 URL: https://issues.apache.org/jira/browse/HDFS-1202 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append, 0.22.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.20-append Attachments: hdfs-1202-0.20-append.txt Missing an isInitialized() check in updateScanStatusInternal -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HDFS-1214) hdfs client metadata cache
[ https://issues.apache.org/jira/browse/HDFS-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash reassigned HDFS-1214: -- Assignee: sam rash hdfs client metadata cache -- Key: HDFS-1214 URL: https://issues.apache.org/jira/browse/HDFS-1214 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Reporter: Joydeep Sen Sarma Assignee: sam rash In some applications, latency is affected by the cost of making rpc calls to namenode to fetch metadata. the most obvious case are calls to fetch file/directory status. applications like hive like to make optimizations based on file size/number etc. - and for such optimizations - 'recent' status data (as opposed to most up-to-date) is acceptable. in such cases, a cache on the DFS client that transparently caches metadata would be greatly benefit applications. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881421#action_12881421 ] sam rash commented on HDFS-1057: i have an updated patch, but it does not yet handle the missing replicas as 0 sized for under construction. there may be other 20 patches to port to make this happen. Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1057: --- Attachment: hdfs-1057-trunk-4.txt includes requested changes by hairong. also handles immediate reading of new files by translating a ReplicaNotFoundException into a 0-length block within DFSInputStream for under construction files Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Fix For: 0.20-append Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
[ https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880987#action_12880987 ] sam rash commented on HDFS-1218: a few questions 1. this assumes a DN goes down with the client (either in tandem, or on the same box) and that the NN initiates lease recovery later correct? 2. the idea here is that RBW should have lengths longer than RWR, but both will have the same genstamp? If so, why aren't we just taking the replica with the longest length? Is there a reason to 3. if sync() did not complete, there is no violation. do I follow? i agree we can try to recover more data if it's there, but i just want to make sure i'm on the same page 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization - Key: HDFS-1218 URL: https://issues.apache.org/jira/browse/HDFS-1218 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1281.txt When a datanode experiences power loss, it can come back up with truncated replicas (due to local FS journal replay). Those replicas should not be allowed to truncate the block during block synchronization if there are other replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
[ https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880989#action_12880989 ] sam rash commented on HDFS-1218: I realize in the hadoop code we already swallow InterruptedException frequently, but I think you can change the trend here: {code} // wait for all acks to be received back from datanodes synchronized (ackQueue) { if (!closed ackQueue.size() != 0) { try { ackQueue.wait(); } catch (InterruptedException e) { Thread.currentThread.interrupt(); //add this } continue; } } {code} otherwise, it's very easy to have a thread that I own and manage that has a DFSOutputStream in it that swallows an interrupt. when i check Thread.currentThread.isInterrupted() to see if one of my other threads has interrupted me, i will not see it (the crux here is that swallowing interrupts in threads that hadoop controls are less harmful--this is directly in client code when you call sync()/close()) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization - Key: HDFS-1218 URL: https://issues.apache.org/jira/browse/HDFS-1218 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1281.txt When a datanode experiences power loss, it can come back up with truncated replicas (due to local FS journal replay). Those replicas should not be allowed to truncate the block during block synchronization if there are other replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
[ https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880990#action_12880990 ] sam rash commented on HDFS-1218: disregard comment above, was meant for hdfs-895 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization - Key: HDFS-1218 URL: https://issues.apache.org/jira/browse/HDFS-1218 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1281.txt When a datanode experiences power loss, it can come back up with truncated replicas (due to local FS journal replay). Those replicas should not be allowed to truncate the block during block synchronization if there are other replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-895) Allow hflush/sync to occur in parallel with new writes to the file
[ https://issues.apache.org/jira/browse/HDFS-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880991#action_12880991 ] sam rash commented on HDFS-895: --- re: the patch I realize in the hadoop code we already swallow InterruptedException frequently, but I think you can change the trend here: {code} // wait for all acks to be received back from datanodes synchronized (ackQueue) { if (!closed ackQueue.size() != 0) { try { ackQueue.wait(); } catch (InterruptedException e) { Thread.currentThread.interrupt(); //add this } continue; } } {code} otherwise, it's very easy to have a thread that I own and manage that has a DFSOutputStream in it that swallows an interrupt. when i check Thread.currentThread.isInterrupted() to see if one of my other threads has interrupted me, i will not see it (the crux here is that swallowing interrupts in threads that hadoop controls are less harmful--this is directly in client code when you call sync()/close()) Allow hflush/sync to occur in parallel with new writes to the file -- Key: HDFS-895 URL: https://issues.apache.org/jira/browse/HDFS-895 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client Affects Versions: 0.22.0 Reporter: dhruba borthakur Assignee: Todd Lipcon Fix For: 0.22.0 Attachments: hdfs-895-0.20-append.txt, hdfs-895-20.txt, hdfs-895-trunk.txt, hdfs-895.txt In the current trunk, the HDFS client methods writeChunk() and hflush./sync are syncronized. This means that if a hflush/sync is in progress, an applicationn cannot write data to the HDFS client buffer. This reduces the write throughput of the transaction log in HBase. The hflush/sync should allow new writes to happen to the HDFS client even when a hflush/sync is in progress. It can record the seqno of the message for which it should receice the ack, indicate to the DataStream thread to star flushing those messages, exit the synchronized section and just wai for that ack to arrive. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
[ https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881005#action_12881005 ] sam rash commented on HDFS-1218: 1. how can there be a pipeline recovery by a client when the client goes down? in client-initiated recovery, it sends in the list of nodes which excludes the node that went down. Even if a node goes down and comes back up, it won't participate in recovery. The only case I can see that this can occur is if the client is not the one to initiate lease recovery--ie hard or soft limits in the NN. I only point this out because I wonder if this recovery code can be simplified. We already pass in a flag that is a surrogate for indicating NN initiated lease recovery (closeFile == true = NN). maybe not, but I wanted to throw it out there. 2. hmm, i think i see, it's sort of like using RBW and RWR as 1 and 0, and tacking to the genstamp so that you take the highest appended genstamp and take the shortest length of those as the length of the block. in this way, you are auto-incrementing the genstamp in a way... but I think there's still an edge case: i. client node has network trouble (slow, falls off) and transfer to next DN in pipeline from primary slowed/stops (going to timeout) ii. DN-1 writes after putting bytes into network buffer iii. bytes make it to first DN disk, but do not leave OS network stack iv. DN-1 comes up before NN starts hard expiry lease recovery v. we use the other DNs length which is shorter or do I misunderstand? 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization - Key: HDFS-1218 URL: https://issues.apache.org/jira/browse/HDFS-1218 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1281.txt When a datanode experiences power loss, it can come back up with truncated replicas (due to local FS journal replay). Those replicas should not be allowed to truncate the block during block synchronization if there are other replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
[ https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881011#action_12881011 ] sam rash commented on HDFS-1218: ah, interesting. so the point of this fix isn't to get the best block, but to maintain sync semantics? 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization - Key: HDFS-1218 URL: https://issues.apache.org/jira/browse/HDFS-1218 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1281.txt When a datanode experiences power loss, it can come back up with truncated replicas (due to local FS journal replay). Those replicas should not be allowed to truncate the block during block synchronization if there are other replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
[ https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881025#action_12881025 ] sam rash commented on HDFS-1218: re: the patch shouldn't the skipping of RWRs be inside the else block? if keepLength is passed by a client, the fact the block length matches should be the sole criteria for accepting it right? there is not a notion of better. (tho, I don't think it will ever be the case that we have RWRs participating in a client-initiated recovery. soft expiry even comes from the NN where keepLength=false) {code} if (!shouldRecoverRwrs info.wasRecoveredOnStartup()) { LOG.info(Not recovering replica + record + since it was recovered on + startup and we have better replicas); continue; } if (keepLength) { if (info.getBlock().getNumBytes() == block.getNumBytes()) { syncList.add(record); } } else { syncList.add(record); if (info.getBlock().getNumBytes() minlength) { minlength = info.getBlock().getNumBytes(); } } {code} 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization - Key: HDFS-1218 URL: https://issues.apache.org/jira/browse/HDFS-1218 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append Attachments: hdfs-1281.txt When a datanode experiences power loss, it can come back up with truncated replicas (due to local FS journal replay). Those replicas should not be allowed to truncate the block during block synchronization if there are other replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880278#action_12880278 ] sam rash commented on HDFS-1057: So removing setBytesOnDisk() means: {code} if (replicaInfo instanceof ReplicaBeingWritten) { ((ReplicaBeingWritten) replicaInfo) .setLastChecksumAndDataLen(offsetInBlock, lastChunkChecksum); } replicaInfo.setBytesOnDisk(offsetInBlock); {code} will not have the latter. So all other implementations of Replica will have a valid value for getByteOnDIsk()? Does this also mean that the impl of getBytesOnDisk for ReplicaInPipeline will move to ReplicaBeingWritten ? Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880291#action_12880291 ] sam rash commented on HDFS-1057: another way to ask this: only ReplicaBeingWritten needs to have the bytes on disk set in BlockRecevier? Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880346#action_12880346 ] sam rash commented on HDFS-1057: got it. i will make the changes and get a patch soon Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879460#action_12879460 ] sam rash commented on HDFS-1057: 1. they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes. It's entirely conceivable that something could update the bytes on disk w/o updating the lastChecksum with the current set of methods If we are ok with a loosely coupled guarantee, then we can use bytesOnDisk and be careful never to call setBytesOnDisk() for any RBW 2. oh--your previous comments indicated we shouldn't change either ReplicaInPipelineInterface or ReplicaInPipeline. If that's not the case and we can do this, then my comment above doesn't hold. we use bytesOnDisk and guarantee it's in sync with the checksum in a single synchronized method (I like this) 3. will make the update to treat missing last blocks as 0-length and re-instate the unit test. thanks for all the help on this Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.20-append, 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876380#action_12876380 ] sam rash commented on HDFS-1057: I am addressing the last comments. I have one more question, though, as I have one test that it still fails and I want to see what you think the expected behavior should be: immediate read of a new file: 1. writer creates a file and starts to write and hence blocks are assigned in the NN 2. a reader gets these locations and contacts DN 3. DN has not yet put the replica in the volumeMap and FSDataset.getVisibleLength() throws a MissingReplicaException In 0.20, I made it so that client just treats this as a 0-length file. What should the behavior in trunk be? Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.21.0, 0.22.0, 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876410#action_12876410 ] sam rash commented on HDFS-1057: hmm, i can remove the test case. one of our internal tools saw this rather frequently in 0.20--maybe in trunk it's far less likely? Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.21.0, 0.22.0, 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1057: --- Attachment: hdfs-1057-trunk-3.txt 1. endOffset is either bytesOnDisk or the chunkChecksum.getDataLength() 2. if tmpLen = endOffset this is a write in progress, use the in--memory checksum (else this is a finalized block not ending on a chunk bondary) 3. fixed up whitespace Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.21.0, 0.22.0, 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876424#action_12876424 ] sam rash commented on HDFS-1057: also disabled out the test on read of immediate file for now. if we want to change how this is handled, I can enable it Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.21.0, 0.22.0, 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1057: --- Attachment: hdfs-1057-trunk-2.txt 1. new endOffset calc includes determining if in-memory checksum is needed 2. added methods to RBW only to set/get last checksum and data length -track this dataLength separate as setBytesOnDisk may be called independently and make the length/byte[] not match (in theory bytes on disk *could* be set to more and we still want a checksum + the corresponding length kept) 3. appropriate changes around waiting for start + length did not remove all replicaVisibleLength uses yet--want to clarify what to replace them with in pre-existing code. Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.21.0, 0.22.0, 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12875801#action_12875801 ] sam rash commented on HDFS-1057: Thanks for the quick review. I understand most of the comments, but have a couple of questions: 1. replicaVisibleLength was here before I made any changes. Why is it not valid? I understood it to be an upper bound on the bytes that could be read from this block. Is it the case that start + length = replicaVisibleLength and we want to optimize? (the for loop to wait for bytes on disk = visible length was here before, I just moved it earlier in the constructor) 2. not sure I understand endOffset. This was again a variable that already existed. What I thought you were getting at was the condition to decide if we should use the in-memory checksum or not (which is what you describe). 3. If we don't put the sync set/get method in ReplicaInPipelineInterface, we will have to use an if/else construct on instanceof in BlockReceiver and call one or the other. I can see the argument for keeping the method out of the interface since it is RBW-specific, but on the other hand, it's effectively a no-op for other implementers of the interface and leads to cleaner code (better natural polymorphism then if-else constructs to force it). either way, just wanted to throw that out there as a question of style Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.21.0, 0.22.0, 0.20-append Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1142) Lease recovery doesn't reassign lease when triggered by append()
[ https://issues.apache.org/jira/browse/HDFS-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12875303#action_12875303 ] sam rash commented on HDFS-1142: a small sidenote: re: killing writers, it does so *after* getting metadata, so there is still a window under which the client could start another lease recovery, it would complete, and it could start writing and call sync. the 1st lease recovery kills threads, then truncate the block (based on the first set of lengths). This violates sync/hflush semantics. I don't know if there's a jira for this, but I had planned to make the change so the writers *are* killed first thing before getting meta-data. Lease recovery doesn't reassign lease when triggered by append() Key: HDFS-1142 URL: https://issues.apache.org/jira/browse/HDFS-1142 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hdfs-1142.txt, hdfs-1142.txt If a soft lease has expired and another writer calls append(), it triggers lease recovery but doesn't reassign the lease to a new owner. Therefore, the old writer can continue to allocate new blocks, try to steal back the lease, etc. This is for the testRecoveryOnBlockBoundary case of HDFS-1139 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1142) Lease recovery doesn't reassign lease when triggered by append()
[ https://issues.apache.org/jira/browse/HDFS-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12875313#action_12875313 ] sam rash commented on HDFS-1142: todd: ah yea, i had trunk open and just checked--it's exactly that way. nice. we do need 1186, though Lease recovery doesn't reassign lease when triggered by append() Key: HDFS-1142 URL: https://issues.apache.org/jira/browse/HDFS-1142 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hdfs-1142.txt, hdfs-1142.txt If a soft lease has expired and another writer calls append(), it triggers lease recovery but doesn't reassign the lease to a new owner. Therefore, the old writer can continue to allocate new blocks, try to steal back the lease, etc. This is for the testRecoveryOnBlockBoundary case of HDFS-1139 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1142) Lease recovery doesn't reassign lease when triggered by append()
[ https://issues.apache.org/jira/browse/HDFS-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12873212#action_12873212 ] sam rash commented on HDFS-1142: konstantin : nvm last comment, i misread your comment (and forgot 22 == trunk) Lease recovery doesn't reassign lease when triggered by append() Key: HDFS-1142 URL: https://issues.apache.org/jira/browse/HDFS-1142 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hdfs-1142.txt, hdfs-1142.txt If a soft lease has expired and another writer calls append(), it triggers lease recovery but doesn't reassign the lease to a new owner. Therefore, the old writer can continue to allocate new blocks, try to steal back the lease, etc. This is for the testRecoveryOnBlockBoundary case of HDFS-1139 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated HDFS-1057: --- Attachment: hdfs-1057-trunk-1.txt ported patch to trunk (hairong's idea of storing last checksum) Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871285#action_12871285 ] sam rash commented on HDFS-1057: @hairong: I'm looking a little at implementing this in trunk (reading your append/hflush doc from hdfs-265), and I have a question. From above: In each ReplcaBeingWritten, we could have two more fields to keep track of the last consistent state: replica length and the last chunk's crc why does there need to be another length field? the getVisibleLenght() == acked bytes isn't sufficient? if the crc stored in the RBW is for that length, you only need the additional byte[] field which is the last chunk's crc I think. ReplicaBeingWritten.setBytesAcked() could take the crc and atomically set the len + bytes Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871288#action_12871288 ] sam rash commented on HDFS-1057: hmm, in looking at the code more, I see that this depends on what # of bytes we want to make available to readers. -visible length (bytes acked) : needed for consistent view of data I think -bytes on disk : this seems like we'll get inconsistent reads; and in theory, acked data may be *more* than on disk for a given node (if I read the doc + code right). However, how can a DN send data that's not on disk unless it's made available via memory? (very complex) Concurrent readers hit ChecksumExceptions if following a writer to very end of file --- Key: HDFS-1057 URL: https://issues.apache.org/jira/browse/HDFS-1057 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node Affects Versions: 0.21.0, 0.22.0 Reporter: Todd Lipcon Assignee: sam rash Priority: Blocker Attachments: conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.