[ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938931#comment-13938931
 ] 

Jing Zhao commented on HDFS-6089:
---------------------------------

Thanks for the comments, Andrew and Todd!

bq. In EditLogTailer#doTailEdits, I believe that rolling the edit log right 
before is intended to freshen up the edit log for consumption by the SbNN.
But in the currently code, the auto trigger is still running periodically, 
which means we cannot guarantee that we roll the editlog before we call 
doTailEdits. During the failover, we call editLog.recoverUnclosedStreams() and 
EditLogTailer#catchupDuringFailover in FSNamesystem#startActiveServices to 
guarantee the SBN can tail all the editlog. But before failover, if we can make 
the autoroller on the active NN more aggressive (as you suggested), we can 
still guarantee that the SBN will not do a lot of replay on a failover. What do 
you think?

bq. we'll need to update its check period and thresholds to be more aggressive.
Yes, agree. We should assign a smaller value to the sleep interval (maybe 2min 
just like the SBN).

bq. Maybe we should just have a shorter timeout on the rollEditLog call. Or 
somehow..
We can also do this. But to have two auto roller working in two NN at the same 
time still seems not that necessary to me..

> Standby NN while transitioning to active throws a connection refused error 
> when the prior active NN process is suspended
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6089
>                 URL: https://issues.apache.org/jira/browse/HDFS-6089
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>         Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch
>
>
> The following scenario was tested:
> * Determine Active NN and suspend the process (kill -19)
> * Wait about 60s to let the standby transition to active
> * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
> active.
> What was noticed that some times the call to get the service state of nn2 got 
> a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to