[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

Jing Zhao (JIRA) Wed, 19 Mar 2014 15:58:25 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941117#comment-13941117
 ]


Jing Zhao commented on HDFS-6089:
---------------------------------

bq. This is because if we don't have further operations it is possible that SBN 
will wait a long time to tail that part of edits which is in an in-progress 
segment.
bq. In this scenario, the ANN will keep rolling every 2mins, generating a lot 
of edit log segments that aren't being cleared out.
Hmm, actually my thought yesterday was not correct. Yes, we cannot do auto 
rolling simply based on time, and the reason is just like [~andrew.wang] 
pointed out.

Hopefully this is my last question, just want to make sure: the current SBN 
auto roller can cause the same issue like "a lot of edit log segments aren't 
being cleared out" in case that checkpoints are broken (but the SBN is not 
down), right?

Anyway I will post a patch to add rpc timeout later. 

> Standby NN while transitioning to active throws a connection refused error 
> when the prior active NN process is suspended
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6089
>                 URL: https://issues.apache.org/jira/browse/HDFS-6089
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>         Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch
>
>
> The following scenario was tested:
> * Determine Active NN and suspend the process (kill -19)
> * Wait about 60s to let the standby transition to active
> * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to 
> active.
> What was noticed that some times the call to get the service state of nn2 got 
> a socket time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended

Reply via email to