[ 
https://issues.apache.org/jira/browse/HDDS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated HDDS-3481:
-----------------------------
    Description: 
*What's the problem ?*
As the image shows,  scm ask 31 datanodes to replicate container 2037 every 10 
minutes from 2020-04-17 23:38:51.  And at 2020-04-18 08:58:52 scm find the 
replicate num of container 2037 is 12, then it ask 11 datanodes to delete 
container 2037. 
 !screenshot-1.png! 
 !screenshot-2.png! 
*What's the reason ?*

scm check whether  (container replicates num + 
inflightReplication.get(containerId).size() - 
inflightDeletion.get(containerId).size()) is less than 3. If less than 3, it 
will ask some datanode to replicate the container, and add the action into 
inflightReplication.get(containerId). The replicate action time out is 10 
minutes, if action timeout, scm will delete the action from 
inflightReplication.get(containerId) as the image shows. Then (container 
replicates num + inflightReplication.get(containerId).size() - 
inflightDeletion.get(containerId).size()) is less than 3 again, and scm ask 
another datanode to replicate the container.
Because replicate container cost a long time,  sometimes it cannot finish in 10 
minutes, thus 31 datanodes has to replicate the container every 10 minutes.  19 
of 31 datanodes replicate container from the same source datanode,  it will 
also cause big pressure on the source datanode and replicate container become 
slower. Actually it cost 4 hours to finish the first replicate. 
 !screenshot-4.png! 

  was:
*What's the problem ?*
As the image shows,  scm ask 31 datanodes to replicate container 2037 every 10 
minutes from 2020-04-17 23:38:51.  And at 2020-04-18 08:58:52 scm find the 
replicate num of container 2037 is 12, then it ask 11 datanodes to delete 
container 2037. 
 !screenshot-1.png! 
 !screenshot-2.png! 
*What's the reason ?*

scm check whether  (container replicates num + 
inflightReplication.get(containerId).size() - 
inflightDeletion.get(containerId).size()) is less than 3. If less than 3, it 
will ask some datanode to replicate the container, and add the action into 
inflightReplication.get(containerId). The replicate action time out is 10 
minutes, if action timeout, scm will delete the action from 
inflightReplication.get(containerId) as the image shows. Then (container 
replicates num + inflightReplication.get(containerId).size() - 
inflightDeletion.get(containerId).size()) is less than 3 again, and scm ask 
another datanode to replicate the container.
Because replicate container cost a long time,  sometimes it cannot finish in 10 
minutes, thus 31 datanodes has to replicate the container every 10 minutes.  19 
of 31 datanodes replicate container from the same source datanode,  it will 
also cause big pressure on the source datanode and replicate container become 
slower. Actually it cost 4 hours to finish the first replicate. 
 !screenshot-3.png! 


> SCM ask 31 datanodes to replicate the same container
> ----------------------------------------------------
>
>                 Key: HDDS-3481
>                 URL: https://issues.apache.org/jira/browse/HDDS-3481
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Major
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> *What's the problem ?*
> As the image shows,  scm ask 31 datanodes to replicate container 2037 every 
> 10 minutes from 2020-04-17 23:38:51.  And at 2020-04-18 08:58:52 scm find the 
> replicate num of container 2037 is 12, then it ask 11 datanodes to delete 
> container 2037. 
>  !screenshot-1.png! 
>  !screenshot-2.png! 
> *What's the reason ?*
> scm check whether  (container replicates num + 
> inflightReplication.get(containerId).size() - 
> inflightDeletion.get(containerId).size()) is less than 3. If less than 3, it 
> will ask some datanode to replicate the container, and add the action into 
> inflightReplication.get(containerId). The replicate action time out is 10 
> minutes, if action timeout, scm will delete the action from 
> inflightReplication.get(containerId) as the image shows. Then (container 
> replicates num + inflightReplication.get(containerId).size() - 
> inflightDeletion.get(containerId).size()) is less than 3 again, and scm ask 
> another datanode to replicate the container.
> Because replicate container cost a long time,  sometimes it cannot finish in 
> 10 minutes, thus 31 datanodes has to replicate the container every 10 
> minutes.  19 of 31 datanodes replicate container from the same source 
> datanode,  it will also cause big pressure on the source datanode and 
> replicate container become slower. Actually it cost 4 hours to finish the 
> first replicate. 
>  !screenshot-4.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to