[ https://issues.apache.org/jira/browse/HDFS-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247283#comment-14247283 ]
Lei (Eddy) Xu commented on HDFS-6440: ------------------------------------- Hey, [~jesse_yates] Thanks for your answers! I have a few further questions regarding the patch: 1. I did not see where {{isPrimearyCheckPointer}} is set to {{false}}. {code:title=StandbyCheckpointer.java} private boolean isPrimaryCheckPointer = true; ... if (upload.get() == TransferFsImage.TransferResult.SUCCESS) { this.isPrimaryCheckPointer = true; //avoid getting the rest of the results - we don't care since we had a successful upload break; } {code} I guess the default value of {{isPrimaryCheckPointer}} might be a typo, which should be {{false}}. Moreover, is there a case that SNN switches from primary check pointer to non-primary check pointer? 2. Is the following condition correct? I think only {{sendRequest}} is needed. {code:title=StandbyCheckpointer.java} if (needCheckpoint && sendRequest) { {code} Also in the old code, {code} } else if (secsSinceLast >= checkpointConf.getPeriod()) { LOG.info("Triggering checkpoint because it has been " + secsSinceLast + " seconds since the last checkpoint, which " + "exceeds the configured interval " + checkpointConf.getPeriod()); needCheckpoint = true; } {code} Does it implies that if {{secsSinceLast >= checkpointConf.getPeriod()}} is {{true}} then {{secsSinceLast >= checkpointConf.getQuietPeriod()}} is always {{true}}, for default {{quite multiplier}} value? If it is the case, are these duplicated conditions? It looks like that it might be easier to let ANN calculate the above conditions, as it has the actual system-wide knowledge of last upload and last txnid. It could be a nice optimization later. 3. When it uploads fsimage, are {{SC_CONFLICT}} and {{SC_EXPECTATION_FAILED}} not handled in the SNN in the current patch? Do you plan to handle them in a following patch? 4. Could you set {{EditLogTailer#maxRetries}} to {{private final}}? Do we need to enforce an acceptable value range for {{maxRetries}}? For instance, in the following code, it would not try every NN when {{nextNN = nns.size() - 1}} and {{maxRetries = 1}} {code} // if we have reached the max loop count, quit by returning null if (nextNN / nns.size() >= maxRetries) { return null; } {code} 5. There are a few changes due to format, e.g., in {{doCheckpointing()}}. Could you remove them to reduce the size of the patch? Also the following code is indented incorrectly. {code} int i = 0; for (; i < uploads.size(); i++) { Future<TransferFsImage.TransferResult> upload = uploads.get(i); try { // TODO should there be some smarts here about retries nodes that are not the active NN? if (upload.get() == TransferFsImage.TransferResult.SUCCESS) { this.isPrimaryCheckPointer = true; //avoid getting the rest of the results - we don't care since we had a successful upload break; } } catch (ExecutionException e) { ioe = new IOException("Exception during image upload: " + e.getMessage(), e.getCause()); break; } catch (InterruptedException e) { ie = null; break; } } {code} Other parts LGTM. Thanks again for working on this, [~jesse_yates]! > Support more than 2 NameNodes > ----------------------------- > > Key: HDFS-6440 > URL: https://issues.apache.org/jira/browse/HDFS-6440 > Project: Hadoop HDFS > Issue Type: New Feature > Components: auto-failover, ha, namenode > Affects Versions: 2.4.0 > Reporter: Jesse Yates > Assignee: Jesse Yates > Attachments: Multiple-Standby-NameNodes_V1.pdf, > hdfs-6440-cdh-4.5-full.patch, hdfs-multiple-snn-trunk-v0.patch > > > Most of the work is already done to support more than 2 NameNodes (one > active, one standby). This would be the last bit to support running multiple > _standby_ NameNodes; one of the standbys should be available for fail-over. > Mostly, this is a matter of updating how we parse configurations, some > complexity around managing the checkpointing, and updating a whole lot of > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)