[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083334#comment-16083334
 ] 

Weiwei Yang commented on HDFS-12098:
------------------------------------

Hi [~anu]

The difference I noticed is in the mini cluster, the RPC seems directly times 
out without retrying, not sure why the retry policy was not applied. On my 
setup I saw following retries in getVersion call,

{noformat}
17/07/11 19:27:05 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 4 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
MILLISECONDS)
17/07/11 19:27:06 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 5 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
MILLISECONDS)
17/07/11 19:27:07 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 6 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
MILLISECONDS)
{noformat}

these retries will keep the thread alive even the task execution is done. I 
will try to reproduce in a test case.

Thank you for looking at this.

> Ozone: Datanode is unable to register with scm if scm starts later
> ------------------------------------------------------------------
>
>                 Key: HDFS-12098
>                 URL: https://issues.apache.org/jira/browse/HDFS-12098
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, ozone, scm
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Critical
>         Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to