[ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:
----------------------------
    Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
    getEndPointTaskThreadPoolSize(),
    new ThreadFactoryBuilder()
        .setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
    totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
    LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon, a 
slow Recon won't interfere communication between DN and SCM, or vice versa.

 

*P.S.*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually, due to tens of thousands of threads are created.

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
    getEndPointTaskThreadPoolSize(),
    new ThreadFactoryBuilder()
        .setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
    totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
    LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*P.S.*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually, due to tens of thousands of threads are created.


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -------------------------------------------------------------------------
>
>                 Key: HDDS-4386
>                 URL: https://issues.apache.org/jira/browse/HDDS-4386
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
>     getEndPointTaskThreadPoolSize(),
>     new ThreadFactoryBuilder()
>         .setNameFormat("Datanode State Machine Task Thread - %d").build());
> private int getEndPointTaskThreadPoolSize() {
>   // TODO(runzhiwang): current only support one recon, if support multiple
>   //  recon in future reconServerCount should be the real number of recon
>   int reconServerCount = 1;
>   int totalServerCount = reconServerCount;
>   try {
>     totalServerCount += HddsUtils.getSCMAddresses(conf).size();
>   } catch (Exception e) {
>     LOG.error("Fail to get scm addresses", e);
>   }
>   return totalServerCount;
> }
> {code}
> meanwhile, current Recon has some performance issue, after running for hours, 
> it became slower and slower, and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon, a 
> slow Recon won't interfere communication between DN and SCM, or vice versa.
>  
> *P.S.*
> The first edition for DatanodeStateMachine.executorService is a cached thread 
> pool, if there exists a slow SCM/Recon, more and more threads will be 
> created, and DN will OOM eventually, due to tens of thousands of threads are 
> created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to