[ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940229#comment-13940229
 ] 

Jing Zhao commented on HDFS-6089:
---------------------------------

Hi Andrew, thanks for the explanation. I guess I understand your concern now: 
only rolling on ANN based on edits # may cause issue in some scenario. This is 
because if we don't have further operations it is possible that SBN will wait a 
long time to tail that part of edits which is in an in-progress segment.

bq. Checkpointing combines the edit log with the fsimage, and we purge 
unnecessary log segments afterwards.
But I'm still a little confused about this part. I fail to see the difference 
of the based-on-time rolling from SBN and ANN. In the current code, SBN 
triggers rolling still through RPC to ANN. Also this does not affect 
checkpointing and purging: when SBN does a checkpoint, both SBN and ANN will 
purge old edits in their own storage (SBN does the purging before uploading the 
checkpoint, and ANN does it after getting the new fsimage).

So I guess a possible solution may be: just letting ANN does rolling every 
2min. I think this can achieve almost the same effect as the current mechanism, 
without delaying the failover. Or you see some counter examples with this 
change?

Back to the changing the rpc timeout solution. Looks like we have not set 
timeout for this NN-->NN rpc right now (correct me if I'm wrong). Setting a 
timeout (e.g., 20s just like the default timeout from client to NN) of course 
can improve the failover time in our test case, but I still prefer the above 
solution because it makes the rolling behavior simpler and more predictable 
(especially it removes the rpc call from SBN to ANN).

> Standby NN while transitioning to active throws a connection refused error 
> when the prior active NN process is suspended
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6089
>                 URL: https://issues.apache.org/jira/browse/HDFS-6089
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>         Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch
>
>
> The following scenario was tested:
> * Determine Active NN and suspend the process (kill -19)
> * Wait about 60s to let the standby transition to active
> * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
> active.
> What was noticed that some times the call to get the service state of nn2 got 
> a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to