[ 
https://issues.apache.org/jira/browse/HDFS-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249207#comment-14249207
 ] 

Jesse Yates commented on HDFS-6440:
-----------------------------------

I'll post the updated patch somewhere, if you like. However, for the meantime, 
responses!

I think some stuff got a little messed up with the trunk port... these are all 
great catches!

bq. I guess the default value of isPrimaryCheckPointer might be a typo, which 
should be false. 
Yup and

bq. is there a case that SNN switches from primary check pointer to non-primary 
check pointer

Not that I can find either :) Should be that we track success in the transfer 
result from the upload and then update the primary checkpoint status based on 
the success therein (so if no upload is valid, no longer the primary).

bq. 2. Is the following condition correct? I think only sendRequest is needed.

Kinda. I think it should actually be:
{code}
          if (needCheckpoint) {
            doCheckpoint(sendRequest);
{code}

and then make and save the checkpoint, but only send it if we need to 
(sendRequest == true).

bq. If it is the case, are these duplicated conditions?

The quiet period should be larger than the usual checking period (multiplier is 
1.5), so its the separation of the sending the request vs. taking the 
checkpoint that comes into conflict here. I think this logic makes more sense 
with the above change for separating the use of needCheckpoint and 
sendCheckpoint.

bq. might be easier to let ANN calculate the above conditions... It could be a 
nice optimization later.

Definitely! Was trying to keep the change footprint down.

bq. When it uploads fsimage, are SC_CONFLICT and SC_EXPECTATION_FAILED not 
handled in the SNN in the current patch

They somewhat are - they don't throw an exception back out, but are marked as 
'failures'. Either way, in the new version of the patch (coming), in keeping 
with the changes for setting isPrimaryCheckpointer described above, the 
primaryCheckpointStatus is set to the correct value. 

Either, it got a NOT_ACTIVE_NAMENODE_FAILURE on the other SNN or it tried to 
upload an old transaction to the ANN (OLD_TRANSACTION_ID_FAILURE). If its the 
first, the other NN could succeed (making this pSNN) or its an older 
transaction, so it shouldn't be the pSNN. With the caveat you mentioned in your 
last comment about both SNN thinking they are pSNN.

bq.  Could you set EditLogTailer#maxRetries to private final?

That wasn't part of my change set - the code was already there. It looks like 
that its used to set the edit log in testing.

bq. Do we need to enforce an acceptable value range for maxRetries

An interesting idea! I didn't want to spin forever there and instead surface 
the issue to the user by bringing down the NN. My question back is, is there 
another process that will bring down the NN if it cannot reach the other NNs? 
Otherwise, it can get hopelessly out of date and look like a valid standby when 
it really isn't.

bq. NN when nextNN = nns.size() - 1 and maxRetries = 1

Oh, yeah - that's a problem, regardless of the above. Pending patch should fix 
that.

Coming patch should also fix the remainder of the formatting issues.

> Support more than 2 NameNodes
> -----------------------------
>
>                 Key: HDFS-6440
>                 URL: https://issues.apache.org/jira/browse/HDFS-6440
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: auto-failover, ha, namenode
>    Affects Versions: 2.4.0
>            Reporter: Jesse Yates
>            Assignee: Jesse Yates
>         Attachments: Multiple-Standby-NameNodes_V1.pdf, 
> hdfs-6440-cdh-4.5-full.patch, hdfs-multiple-snn-trunk-v0.patch
>
>
> Most of the work is already done to support more than 2 NameNodes (one 
> active, one standby). This would be the last bit to support running multiple 
> _standby_ NameNodes; one of the standbys should be available for fail-over.
> Mostly, this is a matter of updating how we parse configurations, some 
> complexity around managing the checkpointing, and updating a whole lot of 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to