[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

Glen Geng (Jira) Thu, 22 Oct 2020 21:52:29 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Glen Geng updated HDDS-4386:
----------------------------
    Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
    getEndPointTaskThreadPoolSize(),
    new ThreadFactoryBuilder()
        .setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
    getEndPointTaskThreadPoolSize(),
    new ThreadFactoryBuilder()
        .setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -------------------------------------------------------------------------
>
>                 Key: HDDS-4386
>                 URL: https://issues.apache.org/jira/browse/HDDS-4386
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
>     getEndPointTaskThreadPoolSize(),
>     new ThreadFactoryBuilder()
>         .setNameFormat("Datanode State Machine Task Thread - %d").build());
> {code}
> meanwhile, current Recon has some performance issue, after running for hours, 
> it became slower and slower, and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.
>  
> *BTW:*
> The first edition for DatanodeStateMachine.executorService is 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

Reply via email to