[jira] [Commented] (HDFS-15036) Active NameNode should not silently fail the image transfer

Chen Liang (Jira) Mon, 09 Dec 2019 14:34:33 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991998#comment-16991998
 ]


Chen Liang commented on HDFS-15036:
-----------------------------------

Spent some time debugging this issue, I think I found the cause of the issue. 

In HDFS-12979, we introduced a logic that, if a image being uploaded is not too 
far ahead of the previous image, this image upload request is rejected. This is 
to prevent the scenario when there are multiple SbNs, all SbNs upload images to 
ANN too frequently. This is considered as correct behavior, so there is no 
logging indication of any error or anything here (the being "silent" part). 
Both ANN and SbN simply ignore and proceed.

But now it appears that, a side effect of this change, is that during RU, the 
rollback image also has to go through this check, and it could also be 
rejected. If this happens, SbN proceeds assuming upload is done, while ANN 
proceeds with still not receiving the rollback image. The upload silently 
failed in this case.

The check logic that rejects the upload is in {{ImageServlet}}. In my earlier 
test, I just commented out the whole block below and the issue seems gone. But 
I think the fix is probably just adding a new check to ensure this rejection 
only applies to regular image upload, like the newly added line in the line in 
the follow code snippet. But I haven't actually tested changing it this way.:
{code}
              if (checkRecentImageEnable &&
                  NameNodeFile.IMAGE.equals(parsedParams.getNameNodeFile()) && 
// <--- this should fix the issue
                  timeDelta < checkpointPeriod &&
                  txid - lastCheckpointTxid < checkpointTxnCount) {
                // only when at least one of two conditions are met we accept
                // a new fsImage
                // 1. most recent image's txid is too far behind
                // 2. last checkpoint time was too old
                response.sendError(HttpServletResponse.SC_CONFLICT,
                    "Most recent checkpoint is neither too far behind in "
                        + "txid, nor too old. New txnid cnt is "
                        + (txid - lastCheckpointTxid)
                        + ", expecting at least " + checkpointTxnCount
                        + " unless too long since last upload.");
                return null;
              }
{code}


> Active NameNode should not silently fail the image transfer
> -----------------------------------------------------------
>
>                 Key: HDFS-15036
>                 URL: https://issues.apache.org/jira/browse/HDFS-15036
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.10.0
>            Reporter: Konstantin Shvachko
>            Assignee: Chao Sun
>            Priority: Major
>
> Image transfer from Standby NameNode to  Active silently fails on Active, 
> without any logging and not notifying the receiver side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15036) Active NameNode should not silently fail the image transfer

Reply via email to