[jira] [Comment Edited] (HDFS-15036) Active NameNode should not silently fail the image transfer

2019-12-12 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995038#comment-16995038
 ] 

Chen Liang edited comment on HDFS-15036 at 12/12/19 11:42 PM:
--

Thanks [~shv]! I've committed to trunk and branch-2, will commit to branch-3.2 
and branch-3.1 shortly as well.


was (Author: vagarychen):
Thanks [~shv]! I've committed to trunk and branch-2.

> Active NameNode should not silently fail the image transfer
> ---
>
> Key: HDFS-15036
> URL: https://issues.apache.org/jira/browse/HDFS-15036
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.3.0, 2.10.1
>
> Attachments: HDFS-15036.001.patch, HDFS-15036.002.patch, 
> HDFS-15036.003.patch
>
>
> Image transfer from Standby NameNode to  Active silently fails on Active, 
> without any logging and not notifying the receiver side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15036) Active NameNode should not silently fail the image transfer

2019-12-09 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991998#comment-16991998
 ] 

Chen Liang edited comment on HDFS-15036 at 12/9/19 10:36 PM:
-

Spent some time debugging this issue, I think I found the cause of the issue.

In HDFS-12979, we introduced a logic that, if a image being uploaded is not too 
far ahead of the previous image, this image upload request is rejected. This is 
to prevent the scenario when there are multiple SbNs, all SbNs upload images to 
ANN too frequently. This is considered as correct behavior, so there is no 
logging indication of any error or anything here (the being "silent" part). 
Both ANN and SbN simply ignore and proceed.

But now it appears that, a side effect of this change, is that during RU, the 
rollback image also has to go through this check, and it could also be 
rejected. If this happens, SbN proceeds assuming upload is done, while ANN 
proceeds with still not receiving the rollback image. The upload silently 
failed in this case.

The check logic that rejects the upload is in {{ImageServlet}}. In my earlier 
test, I just commented out the whole block below and the issue seems gone. But 
I think the fix is probably just adding a new check to ensure this rejection 
only applies to regular image upload, not rollback image, like the newly added 
line in the line in the follow code snippet. But I haven't actually tested 
changing it this way.:
{code:java}
  if (checkRecentImageEnable &&
  NameNodeFile.IMAGE.equals(parsedParams.getNameNodeFile()) && 
// <--- this should fix the issue, as NameNodeFile.IMAGE_ROLLBACK should bypass 
this
  timeDelta < checkpointPeriod &&
  txid - lastCheckpointTxid < checkpointTxnCount) {
// only when at least one of two conditions are met we accept
// a new fsImage
// 1. most recent image's txid is too far behind
// 2. last checkpoint time was too old
response.sendError(HttpServletResponse.SC_CONFLICT,
"Most recent checkpoint is neither too far behind in "
+ "txid, nor too old. New txnid cnt is "
+ (txid - lastCheckpointTxid)
+ ", expecting at least " + checkpointTxnCount
+ " unless too long since last upload.");
return null;
  }
{code}


was (Author: vagarychen):
Spent some time debugging this issue, I think I found the cause of the issue. 

In HDFS-12979, we introduced a logic that, if a image being uploaded is not too 
far ahead of the previous image, this image upload request is rejected. This is 
to prevent the scenario when there are multiple SbNs, all SbNs upload images to 
ANN too frequently. This is considered as correct behavior, so there is no 
logging indication of any error or anything here (the being "silent" part). 
Both ANN and SbN simply ignore and proceed.

But now it appears that, a side effect of this change, is that during RU, the 
rollback image also has to go through this check, and it could also be 
rejected. If this happens, SbN proceeds assuming upload is done, while ANN 
proceeds with still not receiving the rollback image. The upload silently 
failed in this case.

The check logic that rejects the upload is in {{ImageServlet}}. In my earlier 
test, I just commented out the whole block below and the issue seems gone. But 
I think the fix is probably just adding a new check to ensure this rejection 
only applies to regular image upload, like the newly added line in the line in 
the follow code snippet. But I haven't actually tested changing it this way.:
{code}
  if (checkRecentImageEnable &&
  NameNodeFile.IMAGE.equals(parsedParams.getNameNodeFile()) && 
// <--- this should fix the issue
  timeDelta < checkpointPeriod &&
  txid - lastCheckpointTxid < checkpointTxnCount) {
// only when at least one of two conditions are met we accept
// a new fsImage
// 1. most recent image's txid is too far behind
// 2. last checkpoint time was too old
response.sendError(HttpServletResponse.SC_CONFLICT,
"Most recent checkpoint is neither too far behind in "
+ "txid, nor too old. New txnid cnt is "
+ (txid - lastCheckpointTxid)
+ ", expecting at least " + checkpointTxnCount
+ " unless too long since last upload.");
return null;
  }
{code}


> Active NameNode should not silently fail the image transfer
> ---
>
> Key: HDFS-15036
> URL: