[ https://issues.apache.org/jira/browse/HDDS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nanda kumar updated HDDS-3481: ------------------------------ Labels: Triaged pull-request-available (was: TriagePending pull-request-available) > SCM ask too many datanodes to replicate the same container > ---------------------------------------------------------- > > Key: HDDS-3481 > URL: https://issues.apache.org/jira/browse/HDDS-3481 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: SCM > Reporter: runzhiwang > Assignee: runzhiwang > Priority: Blocker > Labels: Triaged, pull-request-available > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png > > > *What's the problem ?* > As the image shows, scm ask 31 datanodes to replicate container 2037 every > 10 minutes from 2020-04-17 23:38:51. And at 2020-04-18 08:58:52 scm find the > replicate num of container 2037 is 12, then it ask 11 datanodes to delete > container 2037. > !screenshot-1.png! > !screenshot-2.png! > *What's the reason ?* > scm check whether (container replicates num + > inflightReplication.get(containerId).size() - > inflightDeletion.get(containerId).size()) is less than 3. If less than 3, it > will ask some datanode to replicate the container, and add the action into > inflightReplication.get(containerId). The replicate action time out is 10 > minutes, if action timeout, scm will delete the action from > inflightReplication.get(containerId) as the image shows. Then (container > replicates num + inflightReplication.get(containerId).size() - > inflightDeletion.get(containerId).size()) is less than 3 again, and scm ask > another datanode to replicate the container. > Because replicate container cost a long time, sometimes it cannot finish in > 10 minutes, thus 31 datanodes has to replicate the container every 10 > minutes. 19 of 31 datanodes replicate container from the same source > datanode, it will also cause big pressure on the source datanode and > replicate container become slower. Actually it cost 4 hours to finish the > first replicate. > !screenshot-4.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org