[ https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991998#comment-16991998 ]
Chen Liang commented on HDFS-15036: ----------------------------------- Spent some time debugging this issue, I think I found the cause of the issue. In HDFS-12979, we introduced a logic that, if a image being uploaded is not too far ahead of the previous image, this image upload request is rejected. This is to prevent the scenario when there are multiple SbNs, all SbNs upload images to ANN too frequently. This is considered as correct behavior, so there is no logging indication of any error or anything here (the being "silent" part). Both ANN and SbN simply ignore and proceed. But now it appears that, a side effect of this change, is that during RU, the rollback image also has to go through this check, and it could also be rejected. If this happens, SbN proceeds assuming upload is done, while ANN proceeds with still not receiving the rollback image. The upload silently failed in this case. The check logic that rejects the upload is in {{ImageServlet}}. In my earlier test, I just commented out the whole block below and the issue seems gone. But I think the fix is probably just adding a new check to ensure this rejection only applies to regular image upload, like the newly added line in the line in the follow code snippet. But I haven't actually tested changing it this way.: {code} if (checkRecentImageEnable && NameNodeFile.IMAGE.equals(parsedParams.getNameNodeFile()) && // <--- this should fix the issue timeDelta < checkpointPeriod && txid - lastCheckpointTxid < checkpointTxnCount) { // only when at least one of two conditions are met we accept // a new fsImage // 1. most recent image's txid is too far behind // 2. last checkpoint time was too old response.sendError(HttpServletResponse.SC_CONFLICT, "Most recent checkpoint is neither too far behind in " + "txid, nor too old. New txnid cnt is " + (txid - lastCheckpointTxid) + ", expecting at least " + checkpointTxnCount + " unless too long since last upload."); return null; } {code} > Active NameNode should not silently fail the image transfer > ----------------------------------------------------------- > > Key: HDFS-15036 > URL: https://issues.apache.org/jira/browse/HDFS-15036 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.10.0 > Reporter: Konstantin Shvachko > Assignee: Chao Sun > Priority: Major > > Image transfer from Standby NameNode to Active silently fails on Active, > without any logging and not notifying the receiver side. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org