[ 
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963371#comment-16963371
 ] 

Chen Liang commented on HDFS-14941:
-----------------------------------

Summarizing possible fixes at high level I though of, for the record:
1. making {{OP_SET_GENSTAMP_V2}} and {{OP_ADD_BLOCK}} a single edit, so it's 
guaranteed they tailed together. The issue with this is that, we can't add a 
new op, which is going to be incompatible. We probably have to, say, reuse 
{{OP_ADD_BLOCK}} to bump gen stamp AND add block. But compatibility may still 
be tricky. e.g. old ANN sends these two commands, while new SbN expects a 
single one.
2. swapping the order of {{OP_SET_GENSTAMP_V2}} and {{OP_ADD_BLOCK}} (but 
swapping the adding block logic, only changing the edit log order). This 
ensures that when gen stamp bumps, block belonging to this gen has been tailed 
already. The problem with this approach is that, then SbN could be tailing 
block from a future  genstamp, I'm not sure what's the implication of this.
3. instead of messing with edits, we may also change the guarding logic. What 
I'm thinking is that, if SbN keeps track of most recent tailed gen stamp, say 
X. AND the highest gen stamp of blocks in its own block map, say Y. Here Y <= 
X. Then, if a DN reported block has a gen stamp *between* Y and X, then 
possibly it is the scenario that the block still needs to be tailed. So SbN 
requeue this message to process later. 

It would be great if we can get more eyes on this, open to comments on the 
options (and more options!).

> Potential editlog race condition can cause corrupted file
> ---------------------------------------------------------
>
>                 Key: HDFS-14941
>                 URL: https://issues.apache.org/jira/browse/HDFS-14941
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Chen Liang
>            Assignee: Chen Liang
>            Priority: Major
>
> Recently we encountered an issue that, after a failover, NameNode complains 
> corrupted file/missing blocks. The blocks did recover after full block 
> reports, so the blocks are not actually missing. After further investigation, 
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before 
> corresponding edit tailing happened. In which case SbN postpones processing 
> the DN block report, handled by the guarding logic below:
> {code:java}
>       if (shouldPostponeBlocksFromFuture &&
>           namesystem.isGenStampInFuture(iblk)) {
>         queueReportedBlock(storageInfo, iblk, reportedState,
>             QUEUE_REASON_FUTURE_GENSTAMP);
>         continue;
>       }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets 
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
>       // allocate new block, record block locations in INode.
>       newBlock = createNewBlock();
>       INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
>       saveAllocatedBlock(src, inodesInPath, newBlock, targets);
>       persistNewBlock(src, pendingFile);
>       offset = pendingFile.computeFileSize();
> {code}
> The line
>  {{newBlock = createNewBlock();}}
>  Would log an edit entry {{OP_SET_GENSTAMP_V2}} to bump generation stamp on 
> Standby
>  while the following line
>  {{persistNewBlock(src, pendingFile);}}
>  would log another edit entry {{OP_ADD_BLOCK}} to actually add the block on 
> Standby.
> Then the race condition is that, imagine Standby has just processed 
> {{OP_SET_GENSTAMP_V2}}, but not yet {{OP_ADD_BLOCK}} (if they just happen to 
> be in different setment). Now a block report with new generation stamp comes 
> in.
> Since the genstamp bump has already been processed, the reported block may 
> not be considered as future block. So the guarding logic passes. But 
> actually, the block hasn't been added to blockmap, because the second edit is 
> yet to be tailed. So, the block then gets added to invalidate block list and 
> we saw messages like:
> {code:java}
> BLOCK* addBlock: block XXX on node XXX size XXX does not belong to any file
> {code}
> Even worse, since this IBR is effectively lost, the NameNode has no 
> information about this block, until the next full block report. So after a 
> failover, the NN marks it as corrupt.
> This issue won't happen though, if both of the edit entries get tailed all 
> together, so no IBR processing can happen in between. But in our case, we set 
> edit tailing interval to super low (to allow Standby read), so when under 
> high workload, there is a much much higher chance that the two entries are 
> tailed separately, causing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to