[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082656#comment-16082656
 ] 

Anu Engineer commented on HDFS-12098:
-------------------------------------

[~cheersyang] Thanks for reporting this and posting a patch. Before commenting 
on this I would like to simulate this in our unit tests and then test with and 
without your patch.  I am going to modify MiniOzoneCluster  and build it with 
flags called *disableSCM* and *disableKSM*, so we can simulate SCM or KSM being 
down. I will be able to explore the behavior in greater detail with that.

Some thoughts on this patch, if my understanding is correct, isn't the root 
issue that we time out but forget to communicate to the running thread we have 
already timed out ? I was wondering if we add a a AtomicBoolean to each task 
which indicates if it has timed out, then perhaps when the thread comes out it 
can understand the caller has timed out and it will exist that thread ? Do you 
think it will address this issue ? 

The reason why I am asking is that, if we pursue the approach of a single 
thread -- then we have to create many state machines for various tasks -- like 
many SCMs or running some complex SCM commands. 

I am fine with that approach too , but something that I wanted to us to 
consider.


> Ozone: Datanode is unable to register with scm if scm starts later
> ------------------------------------------------------------------
>
>                 Key: HDFS-12098
>                 URL: https://issues.apache.org/jira/browse/HDFS-12098
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, ozone, scm
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Critical
>         Attachments: HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to