[jira] [Comment Edited] (HDFS-14941) Potential editlog race condition can cause corrupted file

2020-10-19 Thread Kihwal Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140608#comment-17140608
 ] 

Kihwal Lee edited comment on HDFS-14941 at 10/19/20, 9:17 PM:
--

Filed HDFS-15421 with more details.


was (Author: kihwal):
Filed HDFS-1542 with more details.

> Potential editlog race condition can cause corrupted file
> -
>
> Key: HDFS-14941
> URL: https://issues.apache.org/jira/browse/HDFS-14941
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>  Labels: ha
> Fix For: 3.3.0, 3.2.2, 2.10.1, 3.1.5
>
> Attachments: HDFS-14941.001.patch, HDFS-14941.002.patch, 
> HDFS-14941.003.patch, HDFS-14941.004.patch, HDFS-14941.005.patch, 
> HDFS-14941.006.patch
>
>
> Recently we encountered an issue that, after a failover, NameNode complains 
> corrupted file/missing blocks. The blocks did recover after full block 
> reports, so the blocks are not actually missing. After further investigation, 
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before 
> corresponding edit tailing happened. In which case SbN postpones processing 
> the DN block report, handled by the guarding logic below:
> {code:java}
>   if (shouldPostponeBlocksFromFuture &&
>   namesystem.isGenStampInFuture(iblk)) {
> queueReportedBlock(storageInfo, iblk, reportedState,
> QUEUE_REASON_FUTURE_GENSTAMP);
> continue;
>   }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets 
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
>   // allocate new block, record block locations in INode.
>   newBlock = createNewBlock();
>   INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
>   saveAllocatedBlock(src, inodesInPath, newBlock, targets);
>   persistNewBlock(src, pendingFile);
>   offset = pendingFile.computeFileSize();
> {code}
> The line
>  {{newBlock = createNewBlock();}}
>  Would log an edit entry {{OP_SET_GENSTAMP_V2}} to bump generation stamp on 
> Standby
>  while the following line
>  {{persistNewBlock(src, pendingFile);}}
>  would log another edit entry {{OP_ADD_BLOCK}} to actually add the block on 
> Standby.
> Then the race condition is that, imagine Standby has just processed 
> {{OP_SET_GENSTAMP_V2}}, but not yet {{OP_ADD_BLOCK}} (if they just happen to 
> be in different setment). Now a block report with new generation stamp comes 
> in.
> Since the genstamp bump has already been processed, the reported block may 
> not be considered as future block. So the guarding logic passes. But 
> actually, the block hasn't been added to blockmap, because the second edit is 
> yet to be tailed. So, the block then gets added to invalidate block list and 
> we saw messages like:
> {code:java}
> BLOCK* addBlock: block XXX on node XXX size XXX does not belong to any file
> {code}
> Even worse, since this IBR is effectively lost, the NameNode has no 
> information about this block, until the next full block report. So after a 
> failover, the NN marks it as corrupt.
> This issue won't happen though, if both of the edit entries get tailed all 
> together, so no IBR processing can happen in between. But in our case, we set 
> edit tailing interval to super low (to allow Standby read), so when under 
> high workload, there is a much much higher chance that the two entries are 
> tailed separately, causing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14941) Potential editlog race condition can cause corrupted file

2019-11-05 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967888#comment-16967888
 ] 

Konstantin Shvachko edited comment on HDFS-14941 at 11/5/19 9:40 PM:
-

Small thing. Still one "cached" genstamp remaining, should be "impending".
{code}
   * Set the current genstamp to the impending genstamp.
{code}
Don't need Jenkins build for that, since it is a comment change only.


was (Author: shv):
Small thing. Still one "cached" genstamp remaining, should be "impending".
{code}
   * Set the current genstamp to the impending genstamp.
{code}

> Potential editlog race condition can cause corrupted file
> -
>
> Key: HDFS-14941
> URL: https://issues.apache.org/jira/browse/HDFS-14941
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>  Labels: ha
> Attachments: HDFS-14941.001.patch, HDFS-14941.002.patch, 
> HDFS-14941.003.patch, HDFS-14941.004.patch
>
>
> Recently we encountered an issue that, after a failover, NameNode complains 
> corrupted file/missing blocks. The blocks did recover after full block 
> reports, so the blocks are not actually missing. After further investigation, 
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before 
> corresponding edit tailing happened. In which case SbN postpones processing 
> the DN block report, handled by the guarding logic below:
> {code:java}
>   if (shouldPostponeBlocksFromFuture &&
>   namesystem.isGenStampInFuture(iblk)) {
> queueReportedBlock(storageInfo, iblk, reportedState,
> QUEUE_REASON_FUTURE_GENSTAMP);
> continue;
>   }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets 
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
>   // allocate new block, record block locations in INode.
>   newBlock = createNewBlock();
>   INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
>   saveAllocatedBlock(src, inodesInPath, newBlock, targets);
>   persistNewBlock(src, pendingFile);
>   offset = pendingFile.computeFileSize();
> {code}
> The line
>  {{newBlock = createNewBlock();}}
>  Would log an edit entry {{OP_SET_GENSTAMP_V2}} to bump generation stamp on 
> Standby
>  while the following line
>  {{persistNewBlock(src, pendingFile);}}
>  would log another edit entry {{OP_ADD_BLOCK}} to actually add the block on 
> Standby.
> Then the race condition is that, imagine Standby has just processed 
> {{OP_SET_GENSTAMP_V2}}, but not yet {{OP_ADD_BLOCK}} (if they just happen to 
> be in different setment). Now a block report with new generation stamp comes 
> in.
> Since the genstamp bump has already been processed, the reported block may 
> not be considered as future block. So the guarding logic passes. But 
> actually, the block hasn't been added to blockmap, because the second edit is 
> yet to be tailed. So, the block then gets added to invalidate block list and 
> we saw messages like:
> {code:java}
> BLOCK* addBlock: block XXX on node XXX size XXX does not belong to any file
> {code}
> Even worse, since this IBR is effectively lost, the NameNode has no 
> information about this block, until the next full block report. So after a 
> failover, the NN marks it as corrupt.
> This issue won't happen though, if both of the edit entries get tailed all 
> together, so no IBR processing can happen in between. But in our case, we set 
> edit tailing interval to super low (to allow Standby read), so when under 
> high workload, there is a much much higher chance that the two entries are 
> tailed separately, causing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14941) Potential editlog race condition can cause corrupted file

2019-11-04 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967129#comment-16967129
 ] 

Konstantin Shvachko edited comment on HDFS-14941 at 11/5/19 1:14 AM:
-

Hey [~vagarychen] the fix looks good, great find. We should never allow reuse 
of gen stamps even after failover.
I suggest we use {{impendingGenStamp}} instead of {{cachedGenStamp}}. You will 
also need to change the related method names. I propose the following edition 
of your comment
{code}
  /**
   * Most recent global generation stamp as seen on Active NameNode.
   * Used by StandbyNode only.
   * StandbyNode does not update its global {@link #generationStamp} during
   * edits tailing. The global generation stamp on StandbyNode is updated
   * when the block with the next generation stamp is actually
   * received
   * during fail-over it is bumped to the last value received from the
   * Active NN through edits and stored as {@link #impendingGenStamp}
   * The former helps to avoid a race condition with IBRs during edits tailing.
   * The latter guarantees that generation stamps are never reused by new Active
   * after fail-over.
   *  See HDFS-14941 for more details.
   */
  private final GenerationStamp impendingGenStamp = new GenerationStamp();
{code}
We also should update this javadoc:
{code}
  /**
   * This operation does not update global generation stamp during edits
   * tailing. The global generation stamp on StandbyNode is updated
   * 1. when the block with the next generation stamp is actually received,
   * 2. during fail-over it is bumped to the last value seen from active.
   */
  static class SetGenstampV2Op extends FSEditLogOp {
{code}


was (Author: shv):
Hey [~vagarychen] the fix looks good, great find. We should never allow reuse 
of gen stamps even after failover.
I suggest we use {{impendingGenStamp}} instead of {{cachedGenStamp}}. You will 
also change the related method names. I propose the following edition of your 
comment
{code}
  /**
   * Most recent global generation stamp as seen on Active NameNode.
   * Used by StandbyNode only.
   * StandbyNode does not update its global {@link #generationStamp} during
   * edits tailing. The global generation stamp on StandbyNode is updated
   * when the block with the next generation stamp is actually
   * received
   * during fail-over it is bumped to the last value received from the
   * Active NN through edits and stored as {@link #impendingGenStamp}
   * The former helps to avoid a race condition with IBRs during edits tailing.
   * The latter guarantees that generation stamps are never reused by new Active
   * after fail-over.
   *  See HDFS-14941 for more details.
   */
  private final GenerationStamp impendingGenStamp = new GenerationStamp();
{code}
We also need to update this javadoc:
{code}
  /**
   * This operation does not update global generation stamp during edits
   * tailing. The global generation stamp on StandbyNode is updated
   * 1. when the block with the next generation stamp is actually received,
   * 2. during fail-over it is bumped to the last value seen from active.
   */
  static class SetGenstampV2Op extends FSEditLogOp {
{code}

> Potential editlog race condition can cause corrupted file
> -
>
> Key: HDFS-14941
> URL: https://issues.apache.org/jira/browse/HDFS-14941
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>  Labels: ha
> Attachments: HDFS-14941.001.patch, HDFS-14941.002.patch
>
>
> Recently we encountered an issue that, after a failover, NameNode complains 
> corrupted file/missing blocks. The blocks did recover after full block 
> reports, so the blocks are not actually missing. After further investigation, 
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before 
> corresponding edit tailing happened. In which case SbN postpones processing 
> the DN block report, handled by the guarding logic below:
> {code:java}
>   if (shouldPostponeBlocksFromFuture &&
>   namesystem.isGenStampInFuture(iblk)) {
> queueReportedBlock(storageInfo, iblk, reportedState,
> QUEUE_REASON_FUTURE_GENSTAMP);
> continue;
>   }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets 
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
>   // allocate new block, record block locations in INode.
>   newBlock = createNewBlock();
>   INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
>   saveAllocatedBlock(src, inodesInPath, newBlock, targets);
>   p

[jira] [Comment Edited] (HDFS-14941) Potential editlog race condition can cause corrupted file

2019-11-04 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967129#comment-16967129
 ] 

Konstantin Shvachko edited comment on HDFS-14941 at 11/5/19 1:14 AM:
-

Hey [~vagarychen] the fix looks good, great find. We should never allow reuse 
of gen stamps even after failover.
I suggest we use {{impendingGenStamp}} instead of {{cachedGenStamp}}. You will 
also change the related method names. I propose the following edition of your 
comment
{code}
  /**
   * Most recent global generation stamp as seen on Active NameNode.
   * Used by StandbyNode only.
   * StandbyNode does not update its global {@link #generationStamp} during
   * edits tailing. The global generation stamp on StandbyNode is updated
   * when the block with the next generation stamp is actually
   * received
   * during fail-over it is bumped to the last value received from the
   * Active NN through edits and stored as {@link #impendingGenStamp}
   * The former helps to avoid a race condition with IBRs during edits tailing.
   * The latter guarantees that generation stamps are never reused by new Active
   * after fail-over.
   *  See HDFS-14941 for more details.
   */
  private final GenerationStamp impendingGenStamp = new GenerationStamp();
{code}
We also need to update this javadoc:
{code}
  /**
   * This operation does not update global generation stamp during edits
   * tailing. The global generation stamp on StandbyNode is updated
   * 1. when the block with the next generation stamp is actually received,
   * 2. during fail-over it is bumped to the last value seen from active.
   */
  static class SetGenstampV2Op extends FSEditLogOp {
{code}


was (Author: shv):
Hey [~vagarychen] the fix looks good, good find. We should never reuse gen 
stamps even after failover.
I suggest we use {{impendingGenStamp}} instead of {{cachedGenStamp}}. You will 
also change the related method names. I propose the following edition of your 
comment
{code}
  /**
   * Most recent global generation stamp as seen on Active NameNode.
   * Used by StandbyNode only.
   * StandbyNode does not update its global {@link #generationStamp} during
   * edits tailing. The global generation stamp on StandbyNode is updated
   * when the block with the next generation stamp is actually
   * received
   * during fail-over it is bumped to the last value received from the
   * Active NN through edits and stored as {@link #impendingGenStamp}
   * The former helps to avoid a race condition with IBRs during edits tailing.
   * The latter guarantees that generation stamps are never reused by new Active
   * after fail-over.
   *  See HDFS-14941 for more details.
   */
  private final GenerationStamp impendingGenStamp = new GenerationStamp();
{code}
We also need to update this javadoc:
{code}
  /**
   * This operation does not update global generation stamp during edits
   * tailing. The global generation stamp on StandbyNode is updated
   * 1. when the block with the next generation stamp is actually received,
   * 2. during fail-over it is bumped to the last value seen from active.
   */
  static class SetGenstampV2Op extends FSEditLogOp {
{code}

> Potential editlog race condition can cause corrupted file
> -
>
> Key: HDFS-14941
> URL: https://issues.apache.org/jira/browse/HDFS-14941
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>  Labels: ha
> Attachments: HDFS-14941.001.patch, HDFS-14941.002.patch
>
>
> Recently we encountered an issue that, after a failover, NameNode complains 
> corrupted file/missing blocks. The blocks did recover after full block 
> reports, so the blocks are not actually missing. After further investigation, 
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before 
> corresponding edit tailing happened. In which case SbN postpones processing 
> the DN block report, handled by the guarding logic below:
> {code:java}
>   if (shouldPostponeBlocksFromFuture &&
>   namesystem.isGenStampInFuture(iblk)) {
> queueReportedBlock(storageInfo, iblk, reportedState,
> QUEUE_REASON_FUTURE_GENSTAMP);
> continue;
>   }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets 
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
>   // allocate new block, record block locations in INode.
>   newBlock = createNewBlock();
>   INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
>   saveAllocatedBlock(src, inodesInPath, newBlock, targets);
>   persistNewBlock(sr

[jira] [Comment Edited] (HDFS-14941) Potential editlog race condition can cause corrupted file

2019-11-03 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16966291#comment-16966291
 ] 

Konstantin Shvachko edited comment on HDFS-14941 at 11/3/19 11:08 PM:
--

Attached a patch, which fixes the problem. Also provides a unit tests to 
reproduce the race condition. This is based on [~vagarychen]'s original patch.
Also wanted to mention that {{updateBlockForPipeline()}} does not update 
block's gen stamp neither on Active nor on Standby. It is followed by 
{{updatePipeline()}}, which changes the gen stamp. So no race here.


was (Author: shv):
Attached a patch, which fixes the problem. Also provides a unit tests to 
reproduce the race condition. This is based on [~vagarychen]'s original patch.

> Potential editlog race condition can cause corrupted file
> -
>
> Key: HDFS-14941
> URL: https://issues.apache.org/jira/browse/HDFS-14941
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>  Labels: ha
> Attachments: HDFS-14941.001.patch
>
>
> Recently we encountered an issue that, after a failover, NameNode complains 
> corrupted file/missing blocks. The blocks did recover after full block 
> reports, so the blocks are not actually missing. After further investigation, 
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before 
> corresponding edit tailing happened. In which case SbN postpones processing 
> the DN block report, handled by the guarding logic below:
> {code:java}
>   if (shouldPostponeBlocksFromFuture &&
>   namesystem.isGenStampInFuture(iblk)) {
> queueReportedBlock(storageInfo, iblk, reportedState,
> QUEUE_REASON_FUTURE_GENSTAMP);
> continue;
>   }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets 
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
>   // allocate new block, record block locations in INode.
>   newBlock = createNewBlock();
>   INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
>   saveAllocatedBlock(src, inodesInPath, newBlock, targets);
>   persistNewBlock(src, pendingFile);
>   offset = pendingFile.computeFileSize();
> {code}
> The line
>  {{newBlock = createNewBlock();}}
>  Would log an edit entry {{OP_SET_GENSTAMP_V2}} to bump generation stamp on 
> Standby
>  while the following line
>  {{persistNewBlock(src, pendingFile);}}
>  would log another edit entry {{OP_ADD_BLOCK}} to actually add the block on 
> Standby.
> Then the race condition is that, imagine Standby has just processed 
> {{OP_SET_GENSTAMP_V2}}, but not yet {{OP_ADD_BLOCK}} (if they just happen to 
> be in different setment). Now a block report with new generation stamp comes 
> in.
> Since the genstamp bump has already been processed, the reported block may 
> not be considered as future block. So the guarding logic passes. But 
> actually, the block hasn't been added to blockmap, because the second edit is 
> yet to be tailed. So, the block then gets added to invalidate block list and 
> we saw messages like:
> {code:java}
> BLOCK* addBlock: block XXX on node XXX size XXX does not belong to any file
> {code}
> Even worse, since this IBR is effectively lost, the NameNode has no 
> information about this block, until the next full block report. So after a 
> failover, the NN marks it as corrupt.
> This issue won't happen though, if both of the edit entries get tailed all 
> together, so no IBR processing can happen in between. But in our case, we set 
> edit tailing interval to super low (to allow Standby read), so when under 
> high workload, there is a much much higher chance that the two entries are 
> tailed separately, causing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14941) Potential editlog race condition can cause corrupted file

2019-10-30 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963518#comment-16963518
 ] 

Konstantin Shvachko edited comment on HDFS-14941 at 10/30/19 11:49 PM:
---

Great find [~vagarychen].
 Yes this is _*existing problem in current code*_, but it is exacerbated by 
fast journal tailing since  small tail-edits.period increases the probability 
of the race condition. In fact we occasionally see missing blocks after failing 
over to SBN on our clusters with pre-CRS code.

Want to suggest another _*possible solution 4*_.
 4. Delay incrementing the global {{generationStamp}} on standby until it 
actually sees the block with that stamp. That is, when {{OP_SET_GENSTAMP_V2}} 
comes to SBN it records the new value in a new variable 
{{lastReportedGenStamp}}. When {{OP_ADD_BLOCK}} with the genStamp that equals 
{{lastReportedGenStamp}} comes the global {{generationStamp}} is set to 
{{lastReportedGenStamp}}. This should also solve the race condition.

We were looking at {{updateBlockForPipeline()}} and found out that it could be 
_*another source of missing blocks on SBN*_, because it only increments the 
global {{generationStamp}}, but does not update the generation stamp of that 
block. The new gen stamp will be eventually updated by subsequent 
{{OP_ADD_BLOCK}}, {{OP_ADD}}, {{OP_CLOSE}}, or {{OP_UPDATE_BLOCKS}}. But the 
race condition with IBRs will be still present. If an IBR comes after 
incrementing the global genStamp the replica will not be queued as the future 
one. Instead since the block genStamp has not been yet updated by the 
subsequent {{OP_ADD_BLOCK}} or such, this replica will be invalidated on SBN.


was (Author: shv):
Great find [~vagarychen].
 Yes this is _*existing problem in current code*_, but it is exacerbated by 
fast journal tailing since  small tail-edits.period increases the probability 
of the race condition. In fact we occasionally see missing blocks after failing 
over to SBN on our clusters with pre-CRS code.

Want to suggest another _*possible solution 4*_.
 4. Delay incrementing the global {{generationStamp}} on standby until it 
actually sees the block with that stamp. That is, when {{OP_SET_GENSTAMP_V2}} 
comes to SBN it records the new value in a new variable 
{{lastReportedGenStamp}}. When {{OP_ADD_BLOCK}} with the genStamp that equals 
{{lastReportedGenStamp}} comes the global {{generationStamp}} is set to 
{{lastReportedGenStamp}}. This should also solve the race condition.

We were looking at {{updateBlockForPipeline()}} and found out that it could be 
_*another source of missing blocks on SBN*_, because it only increments the 
global {{generationStamp}}, but does not update the generation stamp of that 
block. The new gen stamp will be eventually updated by subsequent {{OP_ADD}}, 
or {{OP_CLOSE}}, or {{OP_UPDATE_BLOCKS}}. But the race condition with IBRs will 
be still present. If an IBR comes after incrementing the global genStamp the 
replica will not be queued as the future one. Instead since the block genStamp 
has not been yet updated by the subsequent {{OP_ADD}} or such, this replica 
will be invalidated on SBN.

> Potential editlog race condition can cause corrupted file
> -
>
> Key: HDFS-14941
> URL: https://issues.apache.org/jira/browse/HDFS-14941
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>
> Recently we encountered an issue that, after a failover, NameNode complains 
> corrupted file/missing blocks. The blocks did recover after full block 
> reports, so the blocks are not actually missing. After further investigation, 
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before 
> corresponding edit tailing happened. In which case SbN postpones processing 
> the DN block report, handled by the guarding logic below:
> {code:java}
>   if (shouldPostponeBlocksFromFuture &&
>   namesystem.isGenStampInFuture(iblk)) {
> queueReportedBlock(storageInfo, iblk, reportedState,
> QUEUE_REASON_FUTURE_GENSTAMP);
> continue;
>   }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets 
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
>   // allocate new block, record block locations in INode.
>   newBlock = createNewBlock();
>   INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
>   saveAllocatedBlock(src, inodesInPath, newBlock, targets);
>   persistNewBlock(src, pendingFile);
>   offset = pendingFile.computeFileSize();
> {code}
> The line
>  {{newBlock = createNewBlock

[jira] [Comment Edited] (HDFS-14941) Potential editlog race condition can cause corrupted file

2019-10-30 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963518#comment-16963518
 ] 

Konstantin Shvachko edited comment on HDFS-14941 at 10/30/19 11:44 PM:
---

Great find [~vagarychen].
 Yes this is _*existing problem in current code*_, but it is exacerbated by 
fast journal tailing since  small tail-edits.period increases the probability 
of the race condition. In fact we occasionally see missing blocks after failing 
over to SBN on our clusters with pre-CRS code.

Want to suggest another _*possible solution 4*_.
 4. Delay incrementing the global {{generationStamp}} on standby until it 
actually sees the block with that stamp. That is, when {{OP_SET_GENSTAMP_V2}} 
comes to SBN it records the new value in a new variable 
{{lastReportedGenStamp}}. When {{OP_ADD_BLOCK}} with the genStamp that equals 
{{lastReportedGenStamp}} comes the global {{generationStamp}} is set to 
{{lastReportedGenStamp}}. This should also solve the race condition.

We were looking at {{updateBlockForPipeline()}} and found out that it could be 
_*another source of missing blocks on SBN*_, because it only increments the 
global {{generationStamp}}, but does not update the generation stamp of that 
block. The new gen stamp will be eventually updated by subsequent {{OP_ADD}}, 
or {{OP_CLOSE}}, or {{OP_UPDATE_BLOCKS}}. But the race condition with IBRs will 
be still present. If an IBR comes after incrementing the global genStamp the 
replica will not be queued as the future one. Instead since the block genStamp 
has not been yet updated by the subsequent {{OP_ADD}} or such, this replica 
will be invalidated on SBN.


was (Author: shv):
Great find [~vagarychen].
 Yes this is _*existing problem in current code*_, but it is exacerbated by 
fast journal tailing since  small tail-edits.period increases the probability 
of the race condition. In fact we occasionally see missing blocks after failing 
over to SBN on our clusters with pre-CRS code.

Want to suggest another _*possible solution 4*_.
 4. Delay incrementing the global {{generationStamp}} on standby until it 
actually sees the block with that stamp. That is, when {{OP_SET_GENSTAMP_V2}} 
comes to SBN it records the new value in a new variable 
{{lastReportedGenStamp}}. When {{OP_ADD_BLOCK}} with the genStamp that equals 
{{lastReportedGenStamp}} comes the global {{generationStamp}} is set to 
{{lastReportedGenStamp}}. This should also solve the race condition.

We were looking at {{updateBlockForPipeline()}} and found out that it could be 
_*another source of missing blocks on SBN*_, because it only increments the 
global {{generationStamp}}, but does not update the generation stamp of that 
block. The new gen stamp will be eventually updated by subsequent {{OP_ADD}}, 
or {{OP_CLOSE}}, or {{OP_UPDATE_BLOCKS}}. But the race condition with IBRs will 
be still present. If an IBR comes after incrementing the global genStamp the 
replica will not be in the future, but since the block genStamp has not be yet 
updated by the subsequent {{OP_ADD}} or such, this replica will be invalidated 
on SBN.

> Potential editlog race condition can cause corrupted file
> -
>
> Key: HDFS-14941
> URL: https://issues.apache.org/jira/browse/HDFS-14941
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>
> Recently we encountered an issue that, after a failover, NameNode complains 
> corrupted file/missing blocks. The blocks did recover after full block 
> reports, so the blocks are not actually missing. After further investigation, 
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before 
> corresponding edit tailing happened. In which case SbN postpones processing 
> the DN block report, handled by the guarding logic below:
> {code:java}
>   if (shouldPostponeBlocksFromFuture &&
>   namesystem.isGenStampInFuture(iblk)) {
> queueReportedBlock(storageInfo, iblk, reportedState,
> QUEUE_REASON_FUTURE_GENSTAMP);
> continue;
>   }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets 
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
>   // allocate new block, record block locations in INode.
>   newBlock = createNewBlock();
>   INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
>   saveAllocatedBlock(src, inodesInPath, newBlock, targets);
>   persistNewBlock(src, pendingFile);
>   offset = pendingFile.computeFileSize();
> {code}
> The line
>  {{newBlock = createNewBlock();}}
>  Would log an edit entry {{OP_