[jira] Commented: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS
[ https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859234#action_12859234 ] Hadoop QA commented on HDFS-1031: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442391/hdfs-1031_aoriani_4.patch against trunk revision 936132. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321/console This message is automatically generated. > Enhance the webUi to list a few of the corrupted files in HDFS > -- > > Key: HDFS-1031 > URL: https://issues.apache.org/jira/browse/HDFS-1031 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: André Oriani > Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, > hdfs-1031_aoriani_3.patch, hdfs-1031_aoriani_4.patch > > > The existing webUI displays something like this: > WARNING : There are about 12 missing blocks. Please check the log or run > fsck. > It would be nice if we can display the filenames that have missing blocks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-142) In 0.20, move blocks being written into a blocksBeingWritten directory
[ https://issues.apache.org/jira/browse/HDFS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-142: - Summary: In 0.20, move blocks being written into a blocksBeingWritten directory (was: Datanode should delete files under tmp when upgraded from 0.17) Renaming JIRA to reflect the actual scope of this issue in the branch-20 sync work > In 0.20, move blocks being written into a blocksBeingWritten directory > -- > > Key: HDFS-142 > URL: https://issues.apache.org/jira/browse/HDFS-142 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Raghu Angadi >Assignee: dhruba borthakur >Priority: Blocker > Attachments: appendQuestions.txt, deleteTmp.patch, deleteTmp2.patch, > deleteTmp5_20.txt, deleteTmp5_20.txt, deleteTmp_0.18.patch, handleTmp1.patch, > hdfs-142-minidfs-fix-from-409.txt, > HDFS-142-multiple-blocks-datanode-exception.patch, HDFS-142_20.patch, > testfileappend4-deaddn.txt > > > Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp > directory since these files are not valid anymore. But in 0.18 it moves these > files to normal directory incorrectly making them valid blocks. One of the > following would work : > - remove the tmp files during upgrade, or > - if the files under /tmp are in pre-18 format (i.e. no generation), delete > them. > Currently effect of this bug is that, these files end up failing block > verification and eventually get deleted. But cause incorrect over-replication > at the namenode before that. > Also it looks like our policy regd treating files under tmp needs to be > defined better. Right now there are probably one or two more bugs with it. > Dhruba, please file them if you rememeber. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-142) Datanode should delete files under tmp when upgraded from 0.17
[ https://issues.apache.org/jira/browse/HDFS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-142: - Attachment: testfileappend4-deaddn.txt I found a bug in the append code where it doesn't work properly with the following sequence: - open a file for write - write some data - close it - the DN with the lowest name dies, but not yet marked dead on the NN - a client calls append() to try to recover the lease (not knowing that the file isn't currently under construction) In this case, the client ends up thinking it has opened the file for append, and there's a new lease on the NN side, but on the client side it's in an error state where close() will throw IOE (and not close the new lease). Attaching a new case for TestFileAppend4 for this situation. > Datanode should delete files under tmp when upgraded from 0.17 > -- > > Key: HDFS-142 > URL: https://issues.apache.org/jira/browse/HDFS-142 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Raghu Angadi >Assignee: dhruba borthakur >Priority: Blocker > Attachments: appendQuestions.txt, deleteTmp.patch, deleteTmp2.patch, > deleteTmp5_20.txt, deleteTmp5_20.txt, deleteTmp_0.18.patch, handleTmp1.patch, > hdfs-142-minidfs-fix-from-409.txt, > HDFS-142-multiple-blocks-datanode-exception.patch, HDFS-142_20.patch, > testfileappend4-deaddn.txt > > > Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp > directory since these files are not valid anymore. But in 0.18 it moves these > files to normal directory incorrectly making them valid blocks. One of the > following would work : > - remove the tmp files during upgrade, or > - if the files under /tmp are in pre-18 format (i.e. no generation), delete > them. > Currently effect of this bug is that, these files end up failing block > verification and eventually get deleted. But cause incorrect over-replication > at the namenode before that. > Also it looks like our policy regd treating files under tmp needs to be > defined better. Right now there are probably one or two more bugs with it. > Dhruba, please file them if you rememeber. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-875) NameNode incorretly handles corrupt replicas
[ https://issues.apache.org/jira/browse/HDFS-875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859218#action_12859218 ] Todd Lipcon commented on HDFS-875: -- Is this related/the same as HDFS-900? > NameNode incorretly handles corrupt replicas > > > Key: HDFS-875 > URL: https://issues.apache.org/jira/browse/HDFS-875 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.21.0, 0.22.0 >Reporter: Hairong Kuang > Fix For: 0.21.0, 0.22.0 > > > I reviewed how NameNode handles corrupt replicas as part of work on HDFS-145. > Comparing to releases prior to 0.21, NameNode now does a good job identifying > corrupt replicas, but it seems to me there are two flaws how it handles the > corrupt replicas: > 1. NameNode does not add corrupt replicas to the block locations as what > NameNode does before; > 2. If the corruption is caused by generation stamp mismatch or state > mismatch, the wrong GS and state do not get put in corruptReplicasMap. > Therefore it may lead to the deletion of the wrong replica. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-966) NameNode recovers lease even in safemode
[ https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-966: -- Status: Patch Available (was: Open) > NameNode recovers lease even in safemode > > > Key: HDFS-966 > URL: https://issues.apache.org/jira/browse/HDFS-966 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt > > > The NameNode recovers a lease even when it is in safemode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-966) NameNode recovers lease even in safemode
[ https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-966: -- Status: Open (was: Patch Available) > NameNode recovers lease even in safemode > > > Key: HDFS-966 > URL: https://issues.apache.org/jira/browse/HDFS-966 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt > > > The NameNode recovers a lease even when it is in safemode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-966) NameNode recovers lease even in safemode
[ https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859192#action_12859192 ] dhruba borthakur commented on HDFS-966: --- The failed unit test is datanode.TestDiskError and is not connected to this patch, but I will resubmit this patch again. > NameNode recovers lease even in safemode > > > Key: HDFS-966 > URL: https://issues.apache.org/jira/browse/HDFS-966 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt > > > The NameNode recovers a lease even when it is in safemode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS
[ https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859188#action_12859188 ] André Oriani commented on HDFS-1031: In case hudson is still not adding test result to Jira, the build is http://hudson.zones.apache.org/hudson/view/Hdfs/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321 > Enhance the webUi to list a few of the corrupted files in HDFS > -- > > Key: HDFS-1031 > URL: https://issues.apache.org/jira/browse/HDFS-1031 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: André Oriani > Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, > hdfs-1031_aoriani_3.patch, hdfs-1031_aoriani_4.patch > > > The existing webUI displays something like this: > WARNING : There are about 12 missing blocks. Please check the log or run > fsck. > It would be nice if we can display the filenames that have missing blocks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS
[ https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] André Oriani updated HDFS-1031: --- Attachment: hdfs-1031_aoriani_4.patch Suggestions applied File list sorted Some changes made due semantic issues ( listed files are not potentially corrupt, but in fact corrupt. The list is potentially incomplete) > Enhance the webUi to list a few of the corrupted files in HDFS > -- > > Key: HDFS-1031 > URL: https://issues.apache.org/jira/browse/HDFS-1031 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: André Oriani > Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, > hdfs-1031_aoriani_3.patch, hdfs-1031_aoriani_4.patch > > > The existing webUI displays something like this: > WARNING : There are about 12 missing blocks. Please check the log or run > fsck. > It would be nice if we can display the filenames that have missing blocks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS
[ https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] André Oriani updated HDFS-1031: --- Status: Patch Available (was: Open) > Enhance the webUi to list a few of the corrupted files in HDFS > -- > > Key: HDFS-1031 > URL: https://issues.apache.org/jira/browse/HDFS-1031 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: André Oriani > Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, > hdfs-1031_aoriani_3.patch, hdfs-1031_aoriani_4.patch > > > The existing webUI displays something like this: > WARNING : There are about 12 missing blocks. Please check the log or run > fsck. > It would be nice if we can display the filenames that have missing blocks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS
[ https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] André Oriani updated HDFS-1031: --- Status: Open (was: Patch Available) > Enhance the webUi to list a few of the corrupted files in HDFS > -- > > Key: HDFS-1031 > URL: https://issues.apache.org/jira/browse/HDFS-1031 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: André Oriani > Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, > hdfs-1031_aoriani_3.patch > > > The existing webUI displays something like this: > WARNING : There are about 12 missing blocks. Please check the log or run > fsck. > It would be nice if we can display the filenames that have missing blocks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-909: - Fix Version/s: 0.20.3 > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.20.3, 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.20.txt, hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, > hdfs-909-branch-0.21.txt, hdfs-909-unified.txt, hdfs-909-unittest.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-909: - Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed I just committed this. Thank you Todd. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.20.txt, hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, > hdfs-909-branch-0.21.txt, hdfs-909-unified.txt, hdfs-909-unittest.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-909: - Attachment: hdfs-909-branch-0.21.txt > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.20.txt, hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, > hdfs-909-branch-0.21.txt, hdfs-909-unified.txt, hdfs-909-unittest.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer
[ https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859171#action_12859171 ] sam rash commented on HDFS-1102: actually, i was discussing this with another friend and they pointed out that we don't even need to change how hftp works. even w/chunked encoding, we should be able to verify on the client since it'll send: size1\n size2\n 0 if we don't see fewer than size_N bytes or do not see the 0, we missed data. the underlying http client *should* handle this. if not, we can switch to: http://hc.apache.org/ which apparently is better than using java.net.URL's underlying connection client. > HftpFileSystem : errors during transfer result in truncated transfer > > > Key: HDFS-1102 > URL: https://issues.apache.org/jira/browse/HDFS-1102 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.1 >Reporter: sam rash > > If an error occurs transferring the data over HTTP, the HftpInputStream does > not know it received fewer bytes than the file contains. We can at least > detect this and throw an EOFException when this occurs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-966) NameNode recovers lease even in safemode
[ https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859168#action_12859168 ] Hadoop QA commented on HDFS-966: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442369/leaseRecoverSafeMode2.txt against trunk revision 936024. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/157/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/157/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/157/console This message is automatically generated. > NameNode recovers lease even in safemode > > > Key: HDFS-966 > URL: https://issues.apache.org/jira/browse/HDFS-966 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt > > > The NameNode recovers a lease even when it is in safemode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-780) Revive TestFuseDFS
[ https://issues.apache.org/jira/browse/HDFS-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-780: - Attachment: hdfs-780-1.patch Attached a patch that fixes up all the build files to get the test running again. The test itself fails due to HDFS-940 and some issues with the java code in the test itself. Run with: {code} $ ant -Dcompile.c++=true -Dlibhdfs=true compile $ ant -Dlibhdfs=1 -Dfusedfs=1 test-contrib {code} > Revive TestFuseDFS > -- > > Key: HDFS-780 > URL: https://issues.apache.org/jira/browse/HDFS-780 > Project: Hadoop HDFS > Issue Type: Test > Components: contrib/fuse-dfs >Reporter: Eli Collins >Assignee: Eli Collins > Attachments: hdfs-780-1.patch > > > Looks like TestFuseDFS has bit rot. Let's revive it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859156#action_12859156 ] Hadoop QA commented on HDFS-1101: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442342/H1101-1.patch against trunk revision 936024. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/320/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/320/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/320/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/320/console This message is automatically generated. > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1052) HDFS scalability with multiple namenodes
[ https://issues.apache.org/jira/browse/HDFS-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sanjay Radia updated HDFS-1052: --- Attachment: Mulitple Namespaces5.pdf Minor updates to the doc (plus name change). > HDFS scalability with multiple namenodes > > > Key: HDFS-1052 > URL: https://issues.apache.org/jira/browse/HDFS-1052 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.22.0 >Reporter: Suresh Srinivas >Assignee: Suresh Srinivas > Attachments: Block pool proposal.pdf, Mulitple Namespaces5.pdf > > > HDFS currently uses a single namenode that limits scalability of the cluster. > This jira proposes an architecture to scale the nameservice horizontally > using multiple namenodes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-909: - Attachment: hdfs-909-unified.txt hdfs-909-branch-0.20.txt Here's a unified patch for trunk (the one you committed to trunk plus the test case fixes) Also branch 20 patch that addresses the two eclipse warnings you found. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.20.txt, hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, > hdfs-909-unified.txt, hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859133#action_12859133 ] Konstantin Shvachko commented on HDFS-909: -- - The issue is not closed, so it would be better to have a unified patch, rather than doing 2 commits. I don't mind to recommit. - Test for 0.20 passes fine now. Found 2 (eclipse) warnings in TestEditLogRace: -- Method {{getFormattedFSImage()}} is not used anywhere. -- Static method {{setBufferCapacity()}} should be called in static manner, like {{FSEditLog.setBufferCapacity()}} - I understand Tom's plan for 0.21. It does not hurt to commit though. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-966) NameNode recovers lease even in safemode
[ https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-966: -- Attachment: leaseRecoverSafeMode2.txt Merged patch with latest trunk. > NameNode recovers lease even in safemode > > > Key: HDFS-966 > URL: https://issues.apache.org/jira/browse/HDFS-966 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt > > > The NameNode recovers a lease even when it is in safemode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-966) NameNode recovers lease even in safemode
[ https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-966: -- Status: Patch Available (was: Open) > NameNode recovers lease even in safemode > > > Key: HDFS-966 > URL: https://issues.apache.org/jira/browse/HDFS-966 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt > > > The NameNode recovers a lease even when it is in safemode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859125#action_12859125 ] Todd Lipcon commented on HDFS-909: -- hdfs-909-ammendation.txt goes with this comment above: https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12859069&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12859069 (the test as committed in trunk is flaky as well, this is a patch against trunk that fixes it. The bug is just in the test, though, not the code itself) > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859123#action_12859123 ] Konstantin Shvachko commented on HDFS-909: -- What is hdfs-909-ammendation.txt for? > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859086#action_12859086 ] Todd Lipcon commented on HDFS-909: -- bq. Not sure how much 0.21 is abandoned. I hear people use it with HBase. Here is the patch. The plan for HBase 0.20.5 is to work against Tom's new 21 release or a 20 with HDFS-200 applied, not the current 21 branch. I checked with Cosmin and he is OK moving to what's now trunk. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-909: - Attachment: hdfs-909-branch-0.20.txt Updated branch-20 patch with same changes (plus cleanup of the changes I accidentally left in before) > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859071#action_12859071 ] Konstantin Shvachko commented on HDFS-909: -- FSEditLog,java imports org.apache.tools.ant.taskdefs.WaitFor in your patch for 0.20. As you see I've already committed the other two branches. So it would be good to finish this sooner than later. Thanks. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-909: - Attachment: hdfs-909-ammendation.txt It turns out the test on trunk was flaky as well. The issue was that we were calling saveNamespace directly on the FSImage while also performing edits from the Transactions threads. This is exactly the behavior we're trying to avoid by forcing the NN into safemode first. Also, we were calling verifyEdits() on an edit log that was being simultaneously written to, which is likely to fail if it reads a partial edit. This patch against trunk does the following: - Bumps up the number of rolls and saves to 30 instead of 10, since obviously 10 wasn't enough to have it fail reliably. - Replaces use of the FSN log with the test's own log - Changes the transaction threads to operate via FSN rather than logging directly to the edit log. - Any exceptions thrown by the edits will cause the test to properly fail To verify this fix, I temporarily bumped the constants for number of rolls up to 200 and checked that it passed. This failed sometimes for me without HADOOP-6717, a trivial patch which reduces the amount of log output from new security code. I'll separately amend the branch-20 patch with the same changes. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, > hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer
[ https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859056#action_12859056 ] sam rash commented on HDFS-1102: well, it's not w/o modifying hftp, but a client can verify the content length if it's there, and proceed as it does now if it's not present. we could also have a config option that say "enforce content length" which would cause missing content length => throw IOException. in this way, if both the client and server are on the latest hftp, this works, else it will work as before offhand, i'm not sure how to do this without either changing hftp or wrapping it in some other protocol that does length checking > HftpFileSystem : errors during transfer result in truncated transfer > > > Key: HDFS-1102 > URL: https://issues.apache.org/jira/browse/HDFS-1102 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.1 >Reporter: sam rash > > If an error occurs transferring the data over HTTP, the HftpInputStream does > not know it received fewer bytes than the file contains. We can at least > detect this and throw an EOFException when this occurs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859049#action_12859049 ] Konstantin Shvachko commented on HDFS-1101: --- I agree, this looks better. Thanks. +1 for the patch. > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer
[ https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859030#action_12859030 ] dhruba borthakur commented on HDFS-1102: > It would be nice if we can fix this without changing the hftp protocol. any idea on how this can be done? > HftpFileSystem : errors during transfer result in truncated transfer > > > Key: HDFS-1102 > URL: https://issues.apache.org/jira/browse/HDFS-1102 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.1 >Reporter: sam rash > > If an error occurs transferring the data over HTTP, the HftpInputStream does > not know it received fewer bytes than the file contains. We can at least > detect this and throw an EOFException when this occurs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer
[ https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi resolved HDFS-1102. Resolution: Duplicate Duplicate of HDFS-1085. It would be nice if we can fix this without changing the hftp protocol. > HftpFileSystem : errors during transfer result in truncated transfer > > > Key: HDFS-1102 > URL: https://issues.apache.org/jira/browse/HDFS-1102 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.1 >Reporter: sam rash > > If an error occurs transferring the data over HTTP, the HftpInputStream does > not know it received fewer bytes than the file contains. We can at least > detect this and throw an EOFException when this occurs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1087) Use StringBuilder instead of Formatter for audit logs
[ https://issues.apache.org/jira/browse/HDFS-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated HDFS-1087: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Assignee: Chris Douglas Fix Version/s: 0.22.0 Resolution: Fixed I committed this. Thanks for the review, Nicholas > Use StringBuilder instead of Formatter for audit logs > - > > Key: HDFS-1087 > URL: https://issues.apache.org/jira/browse/HDFS-1087 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: Chris Douglas >Assignee: Chris Douglas >Priority: Minor > Fix For: 0.22.0 > > Attachments: H1087-0.patch, H1087-1.patch, H1087-2.patch > > > The audit logs do not use any {{format}} functionality that cannot be > replaced by a simple, more efficient set of appends. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated HDFS-1101: Status: Open (was: Patch Available) > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated HDFS-1101: Status: Patch Available (was: Open) > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated HDFS-1101: Attachment: H1101-1.patch Forgot to include Konstantin's javac warning fixes > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated HDFS-1101: Attachment: H1101-0.patch I'm sorry, I missed this in review. Though the current patch works for this case with only 1 datanode, pulling it from the DataNode is closer to the intent of the test and doesn't modify MiniDFSCluster. > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: H1101-0.patch, TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859001#action_12859001 ] Todd Lipcon commented on HDFS-909: -- Hi Konstantin, Thanks for the review. It does seem like the test for branch-20 occasionally fails - I had it passing here, but it's flaky and doesn't pass every time. Let me dig into this and upload a new fixed patch. bq. What org.apache.tools.ant.taskdefs.WaitFor is used for? No idea where this came from. I've been trying out Eclipse recently instead of my usual vim, and haven't gotten used to clean up after its "smarts" :) Will doublecheck the next patch for such cruft as well. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, > hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858848#action_12858848 ] Hadoop QA commented on HDFS-1101: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442252/TestDiskErrorLocal.patch against trunk revision 935778. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/319/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/319/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/319/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/319/console This message is automatically generated. > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-481) Bug Fixes + HdfsProxy to use proxy user to impresonate the real user
[ https://issues.apache.org/jira/browse/HDFS-481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srikanth Sundarrajan updated HDFS-481: -- Attachment: HDFS-481-bp-y20s.patch Incremental back port to fix broken unit tests in y20.1xx & y20.101. Tests are broken due to * Missing super user setup when Mini DFS Cluster starts * Missing src/test/resources folder * UserGroupInformation class depending on krb5.conf in the system. (Bypassing that through krb5.conf in ${hadoop.core}/src/test - contrib/hdfsproxy/build.xml change) This patch needs to be applied incrementally over the HDFS-481-NEW.patch. > Bug Fixes + HdfsProxy to use proxy user to impresonate the real user > > > Key: HDFS-481 > URL: https://issues.apache.org/jira/browse/HDFS-481 > Project: Hadoop HDFS > Issue Type: Bug > Components: contrib/hdfsproxy >Affects Versions: 0.21.0 >Reporter: zhiyong zhang >Assignee: Srikanth Sundarrajan > Fix For: 0.22.0 > > Attachments: HDFS-481-bp-y20.patch, HDFS-481-bp-y20.patch, > HDFS-481-bp-y20s.patch, HDFS-481-bp-y20s.patch, HDFS-481-bp-y20s.patch, > HDFS-481-NEW.patch, HDFS-481.out, HDFS-481.patch, HDFS-481.patch, > HDFS-481.patch, HDFS-481.patch, HDFS-481.patch, HDFS-481.patch, > HDFS-481.patch, HDFS-481.patch > > > Bugs: > 1. hadoop-version is not recognized if run ant command from src/contrib/ or > from src/contrib/hdfsproxy > If running ant command from $HADOOP_HDFS_HOME, hadoop-version will be passed > to contrib's build through subant. But if running from src/contrib or > src/contrib/hdfsproxy, the hadoop-version will not be recognized. > 2. LdapIpDirFilter.java is not thread safe. userName, Group & Paths are per > request and can't be class members. > 3. Addressed the following StackOverflowError. > ERROR [org.apache.catalina.core.ContainerBase.[Catalina].[localh > ost].[/].[proxyForward]] Servlet.service() for servlet proxyForward threw > exception > java.lang.StackOverflowError > at > org.apache.catalina.core.ApplicationHttpRequest.getAttribute(ApplicationHttpR > equest.java:229) > This is due to when the target war (/target.war) does not exist, the > forwarding war will forward to its parent context path /, which defines the > forwarding war itself. This cause infinite loop. Added "HDFS Proxy > Forward".equals(dstContext.getServletContextName() in the if logic to break > the loop. > 4. Kerberos credentials of remote user aren't available. HdfsProxy needs to > act on behalf of the real user to service the requests -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS
[ https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858793#action_12858793 ] Rodrigo Schmidt commented on HDFS-1031: --- Side effect in my opinion! > Enhance the webUi to list a few of the corrupted files in HDFS > -- > > Key: HDFS-1031 > URL: https://issues.apache.org/jira/browse/HDFS-1031 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: André Oriani > Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, > hdfs-1031_aoriani_3.patch > > > The existing webUI displays something like this: > WARNING : There are about 12 missing blocks. Please check the log or run > fsck. > It would be nice if we can display the filenames that have missing blocks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer
[ https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858750#action_12858750 ] sam rash commented on HDFS-1102: proposed solution: StreamFile: Datanode will set the content length in the header HftpInputStream: read() method will verify that when underlying input stream from the http connection returns -1 that it has all the bytes, else throws an EOFException > HftpFileSystem : errors during transfer result in truncated transfer > > > Key: HDFS-1102 > URL: https://issues.apache.org/jira/browse/HDFS-1102 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.1 >Reporter: sam rash > > If an error occurs transferring the data over HTTP, the HftpInputStream does > not know it received fewer bytes than the file contains. We can at least > detect this and throw an EOFException when this occurs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer
HftpFileSystem : errors during transfer result in truncated transfer Key: HDFS-1102 URL: https://issues.apache.org/jira/browse/HDFS-1102 Project: Hadoop HDFS Issue Type: Bug Components: data-node, hdfs client Affects Versions: 0.20.1 Reporter: sam rash If an error occurs transferring the data over HTTP, the HftpInputStream does not know it received fewer bytes than the file contains. We can at least detect this and throw an EOFException when this occurs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer
[ https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858748#action_12858748 ] sam rash commented on HDFS-1102: the log entry in the datanode: 2010-04-19 16:42:59,072 ERROR org.mortbay.log: /streamFile java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) at org.mortbay.io.nio.SelectChannelEndPoint.updateKey(SelectChannelEndPoint.java:324) at org.mortbay.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:278) at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:542) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577) at org.apache.hadoop.hdfs.server.namenode.StreamFile.doGet(StreamFile.java:73) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:669) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) > HftpFileSystem : errors during transfer result in truncated transfer > > > Key: HDFS-1102 > URL: https://issues.apache.org/jira/browse/HDFS-1102 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.1 >Reporter: sam rash > > If an error occurs transferring the data over HTTP, the HftpInputStream does > not know it received fewer bytes than the file contains. We can at least > detect this and throw an EOFException when this occurs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-940) libhdfs uses UnixUserGroupInformation
[ https://issues.apache.org/jira/browse/HDFS-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-940: - Summary: libhdfs uses UnixUserGroupInformation (was: libhdfs test uses UnixUserGroupInformation) Fix Version/s: 0.21.0 Description: libhdfs uses non-existant class UnixUserGroupInformation. (was: The libhdfs test fails with the following, needs to be updated since UnixUserGroupInformation was removed. [exec] failed to construct hadoop user unix group info object [exec] Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.UserGroupInformation: method ()V not found [exec] at org.apache.hadoop.security.UnixUserGroupInformation.(UnixUserGroupInformation.java:69) [exec] Call to org/apache/hadoop/security/UnixUserGroupInformation failed! [exec] Oops! Failed to connect to hdfs as user nobody! ) > libhdfs uses UnixUserGroupInformation > - > > Key: HDFS-940 > URL: https://issues.apache.org/jira/browse/HDFS-940 > Project: Hadoop HDFS > Issue Type: Bug > Components: contrib/libhdfs >Affects Versions: 0.22.0 >Reporter: Eli Collins >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > > libhdfs uses non-existant class UnixUserGroupInformation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-1101: -- Assignee: Konstantin Shvachko > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858742#action_12858742 ] Konstantin Shvachko commented on HDFS-1101: --- Looks like this was introduced by HDFS-997. > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko > Fix For: 0.22.0 > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-1101: -- Status: Patch Available (was: Open) > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails
[ https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-1101: -- Attachment: TestDiskErrorLocal.patch MiniDFSCluster overrides the data-node storage directories while the original config is till pointing to the default values. Therefore the directory cannot be found. The patch fixes the problem and two java warnings in MiniDFSCluster. > TestDiskError.testLocalDirs() fails > --- > > Key: HDFS-1101 > URL: https://issues.apache.org/jira/browse/HDFS-1101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: TestDiskErrorLocal.patch > > > {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1014) Error in reading delegation tokens from edit logs.
[ https://issues.apache.org/jira/browse/HDFS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated HDFS-1014: -- Hadoop Flags: [Reviewed] > Error in reading delegation tokens from edit logs. > -- > > Key: HDFS-1014 > URL: https://issues.apache.org/jira/browse/HDFS-1014 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jitendra Nath Pandey >Assignee: Jitendra Nath Pandey > Attachments: HDFS-1014-y20.1.patch, HDFS-1014.2.patch, > HDFS-1014.3.patch > > > When delegation tokens are read from the edit logs...same object is used to > read the identifier and is stored in the token cache. This is wrong because > same object is getting updated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1014) Error in reading delegation tokens from edit logs.
[ https://issues.apache.org/jira/browse/HDFS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated HDFS-1014: -- Status: Resolved (was: Patch Available) Fix Version/s: 0.22.0 Resolution: Fixed I've just committed this. Resolving as fixed. Thanks for the contribution, Jitendra. > Error in reading delegation tokens from edit logs. > -- > > Key: HDFS-1014 > URL: https://issues.apache.org/jira/browse/HDFS-1014 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jitendra Nath Pandey >Assignee: Jitendra Nath Pandey > Fix For: 0.22.0 > > Attachments: HDFS-1014-y20.1.patch, HDFS-1014.2.patch, > HDFS-1014.3.patch > > > When delegation tokens are read from the edit logs...same object is used to > read the identifier and is stored in the token cache. This is wrong because > same object is getting updated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858733#action_12858733 ] Konstantin Shvachko commented on HDFS-909: -- Some more: What {{org.apache.tools.ant.taskdefs.WaitFor}} is used for? And there is an blank line change at the end of FSEditLog. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, > hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858731#action_12858731 ] Konstantin Shvachko commented on HDFS-909: -- Todd, I tried to run TestEditLogRace with your 0.20 patch. It runs forever and finaly times out. Feels like it does a lot of transactions. Could you please verify. In the trunk the same test runs 42 secs. Also you have some debug printouts in the patch, like "= CLOSE DONE", and you use FSnamesystem.LOG for logging in the test. The latter is confising, as then you'd expect a message from FSNamesystem while it comes from the test. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, > hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1096) allow dfsadmin/mradmin refresh of superuser proxy group mappings
[ https://issues.apache.org/jira/browse/HDFS-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boris Shkolnik updated HDFS-1096: - Attachment: HDFS-1096-BP20-7.patch combined the fix into one patch for prev version. > allow dfsadmin/mradmin refresh of superuser proxy group mappings > > > Key: HDFS-1096 > URL: https://issues.apache.org/jira/browse/HDFS-1096 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Attachments: HDFS-1096-BP20-4.patch, HDFS-1096-BP20-6-fix.patch, > HDFS-1096-BP20-6.patch, HDFS-1096-BP20-7.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-909: - Attachment: hdfs-909-branch-0.21.txt Not sure how much 0.21 is abandoned. I hear people use it with HBase. Here is the patch. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > - > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS >Reporter: Cosmin Lehene >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, > hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt, hdfs-909.txt, hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS
[ https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858718#action_12858718 ] André Oriani commented on HDFS-1031: One doubt regarding sorting the output. On my work, l10n is a daily concern, so I pay attention to that when coding. We're going to compare Paths. Path.compareTo method delegates its implementation to java.net.URI.compareTo, which uses String.compareTo for path elements. That method of String class does not uses collation. So for the program: {code} import java.net.URI; import java.net.URISyntaxException; import java.util.Arrays; public class PathComparisons { public static void main(String[] args) { try { URI paths[] = { new URI("file:///b/a"), new URI("file:///a/c"), new URI("file:///a/b"), new URI("file:///b/z"), new URI("file:///a/á"), new URI("file:///b/ç"), new URI("file:///a/b/c/d/e")}; Arrays.sort(paths); for (URI path : paths) {System.out.println(path.toString());} } catch (URISyntaxException e) { } } } {code} The output is {noformat} file:///a/b file:///a/b/c/d/e file:///a/c file:///a/á file:///b/a file:///b/z file:///b/ç {noformat} I mean , the character order is {a,b,z,á,ç} instead of an expected {a,á,b,c,ç,z} Should I handle this or treat it as a minor side-effect ? > Enhance the webUi to list a few of the corrupted files in HDFS > -- > > Key: HDFS-1031 > URL: https://issues.apache.org/jira/browse/HDFS-1031 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: André Oriani > Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, > hdfs-1031_aoriani_3.patch > > > The existing webUI displays something like this: > WARNING : There are about 12 missing blocks. Please check the log or run > fsck. > It would be nice if we can display the filenames that have missing blocks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.