[ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931045#comment-13931045
 ] 

Jing Zhao commented on HDFS-6089:
---------------------------------

Checked the log with Arpit. Looks like the issue is like this:
1. After NN1 got suspended, NN2 started the transition. It first tried to stop 
the editlog tailer thread.
2. The editlog tailer thread happened to trigger NN1 to roll its editlog right 
before the transition, and this rpc call got stuck since NN1 was suspended.
3. It took a relatively long time (>1min) for the rollEditlog rpc call to 
receive the connection reset exception.
4. During this time, NN2 waited for the tailer thread to die, and the 
fsnamesystem lock was held by the stopStandbyService call.
5. haadmin's getServiceState request could not get response (since the lock was 
held by the transition thread in NN2) and timeout (its default socket timeout 
is 20s).

In summary, it is possible that the rollEditlog rpc call from the standby NN to 
the active NN in the editlog tailer thread may delay the NN failover.


> Standby NN while transitioning to active throws a connection refused error 
> when the prior active NN process is suspended
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6089
>                 URL: https://issues.apache.org/jira/browse/HDFS-6089
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>
> The following scenario was tested:
> * Determine Active NN and suspend the process (kill -19)
> * Wait about 60s to let the standby transition to active
> * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
> active.
> What was noticed that some times the call to get the service state of nn2 got 
> a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to