[jira] [Comment Edited] (HDFS-13442) Ozone: Handle Datanode Registration failure

Hanisha Koneru (JIRA) Tue, 17 Apr 2018 14:35:23 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441526#comment-16441526
 ]


Hanisha Koneru edited comment on HDFS-13442 at 4/17/18 9:34 PM:
----------------------------------------------------------------

Thanks for the review [~anu].

This patch only modifies the case when we get _errorNodeNotPermitted_. This 
happens when the node is able to contact the SCM but SCM does not register the 
node. 
{quote}if the data nodes boot up earlier than SCM we would not want the data 
nodes to do silent after 10 tries
{quote}
In this case, the datanode keeps retrying as the EndPointTask state remains as 
{{REGISTER}}. In the code snippet below, if the datanode does not get a 
response from SCM, it catches the exception and logs it, if needed.
{code:java}
    try {
      SCMRegisteredCmdResponseProto response = rpcEndPoint.getEndPoint()
          .register(datanodeDetails.getProtoBufMessage(),
              conf.getStrings(ScmConfigKeys.OZONE_SCM_NAMES));
      ...
      ...
      processResponse(response);
    } catch (IOException ex) {
      rpcEndPoint.logIfNeeded(ex);
    }
{code}
{quote}also in the case, we get the error, errorNodeNotPermitted, should we 
shut down the data node and create some kind of error record on SCM so we can 
get that info back from SCM? I am also ok with the current approach where we 
will let the system slowly go time out.
{quote}
I think we should let the DN make a few retries before shutting it down.


was (Author: hanishakoneru):
Thanks for the review [~anu].

This patch only modifies the case when we get _errorNodeNotPermitted_. This 
happens when the node is able to contact the SCM but SCM does not register the 
node. 
{quote}if the data nodes boot up earlier than SCM we would not want the data 
nodes to do silent after 10 tries
{quote}
In this case, the datanode keeps retrying as the EndPointTask state remains as 
{{HEARTBEAT}}. In the code snippet below, if the datanode does not get a 
response from SCM, it catches the exception and logs it, if needed.
{code:java}
    try {
      SCMRegisteredCmdResponseProto response = rpcEndPoint.getEndPoint()
          .register(datanodeDetails.getProtoBufMessage(),
              conf.getStrings(ScmConfigKeys.OZONE_SCM_NAMES));
      ...
      ...
      processResponse(response);
    } catch (IOException ex) {
      rpcEndPoint.logIfNeeded(ex);
    }
{code}
{quote}also in the case, we get the error, errorNodeNotPermitted, should we 
shut down the data node and create some kind of error record on SCM so we can 
get that info back from SCM? I am also ok with the current approach where we 
will let the system slowly go time out.
{quote}
I think we should let the DN make a few retries before shutting it down.

> Ozone: Handle Datanode Registration failure
> -------------------------------------------
>
>                 Key: HDFS-13442
>                 URL: https://issues.apache.org/jira/browse/HDFS-13442
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ozone
>    Affects Versions: HDFS-7240
>            Reporter: Hanisha Koneru
>            Assignee: Hanisha Koneru
>            Priority: Major
>         Attachments: HDFS-13442-HDFS-7240.001.patch
>
>
> If a datanode is not able to register itself, we need to handle that 
> correctly. 
> If the number of unsuccessful attempts to register with the SCM exceeds a 
> configurable max number, the datanode should not make any more attempts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-13442) Ozone: Handle Datanode Registration failure

Reply via email to