[ 
https://issues.apache.org/jira/browse/HDFS-15738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17252422#comment-17252422
 ] 

Janus Chow commented on HDFS-15738:
-----------------------------------

[~ayushtkn] Thanks for the reply. We tried to solve this issue by throwing the 
_ObserverRetryOnActive__Exception_ at first but then we come with the following 
thoughts:
 * This issue won't happen under normal circumstances, as this is kind of 
misoperation during the cluster maintenance. The NameNode is transitioned to 
Observer state without the block report done, it's like transition the NameNode 
to Active State manually. The difference is for Active NameNode, the operator 
is more careful so he shall wait the NameNode to quit the safe mode before 
doing other operations, but basically, this kind of operation should not be 
encouraged.
 * If we choose to solve this issue with solution 1, that is throwing the 
ObserverRetryOnActiveException, then the NameNode is transitioned to Observer 
state, the requests are redirected to the Active NameNode, the operator would 
not know that the operation is not encouraged, so he may keep doing this kind 
of operation and spread the wrong idea that the NameNode is OK to be 
transitioned be Observer State before quitting safe mode since there is nothing 
wrong with the cluster.
 * Solution 1 is kind of a good solution for end-users by solving the issue on 
the server-side without the notice of end-users, but for operators I think we 
should have higher standards for them that they should know something is not 
encouraged to do that a NameNode in startup safe mode should not be 
transitioned to Active NameNode or Observer NameNode either.

Above is our thought within some inner consideration and expectation of our 
operators. Hope to get your advice.

> Forbid the transition to Observer state when NameNode is in StartupSafeMode
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-15738
>                 URL: https://issues.apache.org/jira/browse/HDFS-15738
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Janus Chow
>            Assignee: Janus Chow
>            Priority: Major
>         Attachments: HDFS-15738.001.patch
>
>
> Currently when a _getBlockLocation_ request comes to an Observer Namenode 
> which is in safemode, NameNode will have a check that if the result is empty, 
> it will reply to the client with a _RetriableException_, noting the client to 
> retry the request later.
> And If the Observer Namenode is in startup safe mode, the client would have 
> to wait for the Observer NameNode to leave the safe mode. For a big cluster, 
> it may cause a long time of waiting for the client. In our cluster, we met 
> this problem, and the client needs to wait for about 30 minutes before the 
> service back to normal.
> The reason for this situation is that the NameNode becomes the state of 
> Observer when it's still in safe mode getting Datanode's block reports. And 
> here are two solutions for this issue:
>  # Throw _ObserverRetryOnActiveException_ when the Observer NameNode is in 
> startup safe mode, redirecting the user's requests to active NN.
>  # Forbid the transition to Observer state when the cluster maintainer is 
> trying to do the transition operation.
> We choose the second solution because the first one would abet the bad 
> operation of transition NN to Observers while it's not ready for real service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to