[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802941#action_12802941 ]
Cosmin Lehene commented on HDFS-909: ------------------------------------ Hi Todd, I haven't check yet, so it's possible to be on 0.20 as well. I forgot to add that the issue particularly nasty because it will first fail silently. In our case, the log was corrupted on 17th of December but we only discovered it yesterday when we restarted HDFS. It can be detected early by monitoring the secondary-namenode.out log file. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > ----------------------------------------------------------------------------------------------------- > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.21.0, 0.22.0 > Environment: CentOS > Reporter: Cosmin Lehene > Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > In the edits file we found the first 2 entries: > {code:title=edits} > FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F7570010100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D3132363130383236333138363902 > ... > {code} > This is the completeFile operation that's missing the last byte > {code:title=completeFile} > FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F757001?? > {code} > followed by a rename operation > {code:Title=rename} > 0100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D31323631303832363331383639 > {code} > The first byte of the rename was instead read as part of the completeFile() > operation. This resulted in reading the next operation as 0x00 (OP_ADD) > followed by an int (length) which would have been 0x0000030 which is 768 that > was read and failed in the following code > {code:Title=FSEditLog.java} > case OP_ADD: > case OP_CLOSE: { > // versions > 0 support per file replication > // get name and replication > int length = in.readInt(); > if (-7 == logVersion && length != 3|| > -17 < logVersion && logVersion < -7 && length != 4 || > logVersion <= -17 && length != 5) { > throw new IOException("Incorrect data format." + > " logVersion is " + logVersion + > " but writables.length is " + > length + ". "); > {code} > Filling the missing byte with a value resulted in correct interpretation of > the 0x01 (OP_RENAME) and correct parsing for the rest of the edits file (>1MB) > This theory is also supported by the NameNode log file from the date the > corruption took place: > {code:title=namenode.log} > 2009-12-17 12:43:51,276 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from > 10.72.17.162 > 2009-12-17 12:43:51,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.165:50010 is added to > blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.165:50010|RBW], > ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864 > 2009-12-17 12:43:51,339 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to > blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.165:50010|RBW], > ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864 > 2009-12-17 12:43:51,342 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/demo__users/compaction.dir/364051646/8328148719070317279. > blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.164:50010|RBW], > ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} > 2009-12-17 12:43:51,352 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.167:50010 is added to > blk_-480585673051114658_235109{blockUCState=COMMITTED, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.165:50010|RBW], > ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864 > 2009-12-17 12:43:51,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.163:50010 is added to > blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.164:50010|RBW], > ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498 > 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.164:50010 is added to > blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.164:50010|RBW], > ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498 > 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to > blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.164:50010|RBW], > ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498 > 2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/demo__users/compaction.dir/364051646/8328148719070317279 is closed by > DFSClient_-1989779667 > 2009-12-17 12:43:51,835 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll FSImage from > 10.72.17.162 > 2009-12-17 12:43:51,870 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop > ip=/10.72.17.166 cmd=rename > src=/hbase/demo__users/compaction.dir/364051646/8328148719070317279 > dst=/hbase/demo__users/364051646/test/6917381288304474386 > perm=hadoop:supergroup:rw-r--r-- > {code} > The last 3 entries show a completeFile operation followed by a rollFSIMage > operation followed by a rename operation. This is where most probably the > race condition took place. > Synchronizing access to EditLogOutputStream could fix the problem, however > other race or deadlocks may still occur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.