[ 
https://issues.apache.org/jira/browse/HDFS-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247283#comment-14247283
 ] 

Lei (Eddy) Xu commented on HDFS-6440:
-------------------------------------

Hey, [~jesse_yates] Thanks for your answers!

I have a few further questions regarding the patch:

1. I did not see where {{isPrimearyCheckPointer}} is set to {{false}}. 

{code:title=StandbyCheckpointer.java}
private boolean isPrimaryCheckPointer = true;
...
if (upload.get() == TransferFsImage.TransferResult.SUCCESS) {
   this.isPrimaryCheckPointer = true;
   //avoid getting the rest of the results - we don't care since we had a 
successful upload
   break;
}
{code}

I guess the default value of {{isPrimaryCheckPointer}} might be a typo, which 
should be {{false}}. Moreover, is there a case that    SNN switches from 
primary check pointer to non-primary check pointer?


2. Is the following condition correct? I think only {{sendRequest}} is needed. 

{code:title=StandbyCheckpointer.java}
if (needCheckpoint && sendRequest) {
{code}
 
Also in the old code,

{code}
} else if (secsSinceLast >= checkpointConf.getPeriod()) {
            LOG.info("Triggering checkpoint because it has been " +
                secsSinceLast + " seconds since the last checkpoint, which " +
                "exceeds the configured interval " + 
checkpointConf.getPeriod());
            needCheckpoint = true;
          }
{code}

Does it implies that if {{secsSinceLast >= checkpointConf.getPeriod()}} is 
{{true}} then {{secsSinceLast >= checkpointConf.getQuietPeriod()}} is always 
{{true}}, for default {{quite multiplier}} value? If it is the case, are these 
duplicated conditions?

It looks like that it might be easier to let ANN calculate the above 
conditions, as it has the actual system-wide knowledge of last upload and last 
txnid. It could be a nice optimization later.

3. When it uploads fsimage, are {{SC_CONFLICT}} and {{SC_EXPECTATION_FAILED}} 
not handled in the SNN in the current patch? Do you plan to handle them in a 
following patch?

4. Could you set {{EditLogTailer#maxRetries}} to {{private final}}? Do we need 
to enforce an acceptable value range for {{maxRetries}}? For instance, in the 
following code, it would not try every NN when {{nextNN = nns.size() - 1}} and 
{{maxRetries = 1}}

{code}
// if we have reached the max loop count, quit by returning null
  if (nextNN / nns.size() >= maxRetries) {
      return null;
  }
{code}

5. There are a few changes due to format, e.g., in {{doCheckpointing()}}. Could 
you remove them to reduce the size of the patch?

Also the following code is indented incorrectly.

{code}
int i = 0;
      for (; i < uploads.size(); i++) {
        Future<TransferFsImage.TransferResult> upload = uploads.get(i);
        try {
          // TODO should there be some smarts here about retries nodes that are 
not the active NN?
          if (upload.get() == TransferFsImage.TransferResult.SUCCESS) {
            this.isPrimaryCheckPointer = true;
            //avoid getting the rest of the results - we don't care since we 
had a successful upload
            break;
          }
        } catch (ExecutionException e) {
          ioe = new IOException("Exception during image upload: " + 
e.getMessage(),
              e.getCause());
          break;
        } catch (InterruptedException e) {
          ie = null;
          break;
        }
      }
{code}

Other parts LGTM. Thanks again for working on this, [~jesse_yates]!

> Support more than 2 NameNodes
> -----------------------------
>
>                 Key: HDFS-6440
>                 URL: https://issues.apache.org/jira/browse/HDFS-6440
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: auto-failover, ha, namenode
>    Affects Versions: 2.4.0
>            Reporter: Jesse Yates
>            Assignee: Jesse Yates
>         Attachments: Multiple-Standby-NameNodes_V1.pdf, 
> hdfs-6440-cdh-4.5-full.patch, hdfs-multiple-snn-trunk-v0.patch
>
>
> Most of the work is already done to support more than 2 NameNodes (one 
> active, one standby). This would be the last bit to support running multiple 
> _standby_ NameNodes; one of the standbys should be available for fail-over.
> Mostly, this is a matter of updating how we parse configurations, some 
> complexity around managing the checkpointing, and updating a whole lot of 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to