[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828428#action_12828428 ]
Todd Lipcon commented on HDFS-909: ---------------------------------- bq. May be we should synchronize on the EditLogFileOutputStream level, then flushAndSync() and setReadyToFlush() will not intervene with each other? If we do that, isn't there a worry that each stream could flush at a slightly different point? If we do HDFS-903 and add some kind of checksumming/verification to the edit logs (and we should) then we rely on them rolling together. (Maybe I'm misunderstanding the locking you're suggesting?) bq. In TestEditLogRace you call FSEditLog methods directly avoiding the FSNamesystem layer. Would it be better to involve the FSNamesystem lock into the picture? I don't think the namesystem lock really matters, since all the write operations are separated into an internalFoo() which does the work while synchronized, followed by the logSync unsynchronized. I was trying to keep the test as close to a "unit test" as possible though I did have to use a minicluster anyway. bq. I fear there is a similar race here, which it does not solve... Let me add a unit test for this one as well and see what I come up with. I had forgotten about saveNamespace. Thanks > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > ----------------------------------------------------------------------------------------------------- > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS > Reporter: Cosmin Lehene > Assignee: Todd Lipcon > Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-unittest.txt, hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > In the edits file we found the first 2 entries: > {code:title=edits} > FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F7570010100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D3132363130383236333138363902 > ... > {code} > This is the completeFile operation that's missing the last byte > {code:title=completeFile} > FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F757001?? > {code} > followed by a rename operation > {code:Title=rename} > 0100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D31323631303832363331383639 > {code} > The first byte of the rename was instead read as part of the completeFile() > operation. This resulted in reading the next operation as 0x00 (OP_ADD) > followed by an int (length) which would have been 0x0000030 which is 768 that > was read and failed in the following code > {code:Title=FSEditLog.java} > case OP_ADD: > case OP_CLOSE: { > // versions > 0 support per file replication > // get name and replication > int length = in.readInt(); > if (-7 == logVersion && length != 3|| > -17 < logVersion && logVersion < -7 && length != 4 || > logVersion <= -17 && length != 5) { > throw new IOException("Incorrect data format." + > " logVersion is " + logVersion + > " but writables.length is " + > length + ". "); > {code} > Filling the missing byte with a value resulted in correct interpretation of > the 0x01 (OP_RENAME) and correct parsing for the rest of the edits file (>1MB) > This theory is also supported by the NameNode log file from the date the > corruption took place: > {code:title=namenode.log} > 2009-12-17 12:43:51,276 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from > 10.72.17.162 > 2009-12-17 12:43:51,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.165:50010 is added to > blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.165:50010|RBW], > ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864 > 2009-12-17 12:43:51,339 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to > blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.165:50010|RBW], > ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864 > 2009-12-17 12:43:51,342 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/demo__users/compaction.dir/364051646/8328148719070317279. > blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.164:50010|RBW], > ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} > 2009-12-17 12:43:51,352 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.167:50010 is added to > blk_-480585673051114658_235109{blockUCState=COMMITTED, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.165:50010|RBW], > ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864 > 2009-12-17 12:43:51,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.163:50010 is added to > blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.164:50010|RBW], > ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498 > 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.164:50010 is added to > blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.164:50010|RBW], > ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498 > 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.72.17.166:50010 is added to > blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], > ReplicaUnderConstruction[10.72.17.164:50010|RBW], > ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498 > 2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/demo__users/compaction.dir/364051646/8328148719070317279 is closed by > DFSClient_-1989779667 > 2009-12-17 12:43:51,835 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll FSImage from > 10.72.17.162 > 2009-12-17 12:43:51,870 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop > ip=/10.72.17.166 cmd=rename > src=/hbase/demo__users/compaction.dir/364051646/8328148719070317279 > dst=/hbase/demo__users/364051646/test/6917381288304474386 > perm=hadoop:supergroup:rw-r--r-- > {code} > The last 3 entries show a completeFile operation followed by a rollFSIMage > operation followed by a rename operation. This is where most probably the > race condition took place. > Synchronizing access to EditLogOutputStream could fix the problem, however > other race or deadlocks may still occur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.