[ 
https://issues.apache.org/jira/browse/HDDS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130640#comment-17130640
 ] 

Stephen O'Donnell commented on HDDS-3481:
-----------------------------------------

I saw this problem a long time back in theory from reading the code.

I don't think it is a good idea for SCM to hand out all the replication work 
immediately. After SCM passes out the commands, it loses the ability to adjust 
the work later. It is effectively flooding downstream workers who have no 
ability to provide back pressure and indicate they are overloaded.

Eg, if it needs to replicate 1000 containers, and it gives 500 to node 1 and 
500 to node 2. What if node one completes its work more quickly (maybe its 
under less read load, has faster disks, is on the same rack as the target ... ) 
- then we cannot just take some of the containers allocated to node 2 and give 
them to node 1 to complete replication faster, as the commands are fired with 
no easy way to see their progress or cancel them.

It is better for the supervisor (SCM) to hand out the work incrementally as the 
workers have capacity for it. Even with a longer timeout, I reckon this bad 
feedback loop will happen.

This is roughly how HDFS does it - there is a replication queue in the 
namenode, and each datanode has a limit of how many replications it can have. 
On each heartbeat, it gets given more work up to its maximum. The namenode 
holds the work back until the workers have capacity to receive it. There isn't 
a feedback loop for the commands in HDFS, but the limit of work + a relatively 
short deadline to complete that work results in it working well.

> SCM ask too many datanodes to replicate the same container
> ----------------------------------------------------------
>
>                 Key: HDDS-3481
>                 URL: https://issues.apache.org/jira/browse/HDDS-3481
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: SCM
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Blocker
>              Labels: TriagePending, pull-request-available
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> *What's the problem ?*
> As the image shows,  scm ask 31 datanodes to replicate container 2037 every 
> 10 minutes from 2020-04-17 23:38:51.  And at 2020-04-18 08:58:52 scm find the 
> replicate num of container 2037 is 12, then it ask 11 datanodes to delete 
> container 2037. 
>  !screenshot-1.png! 
>  !screenshot-2.png! 
> *What's the reason ?*
> scm check whether  (container replicates num + 
> inflightReplication.get(containerId).size() - 
> inflightDeletion.get(containerId).size()) is less than 3. If less than 3, it 
> will ask some datanode to replicate the container, and add the action into 
> inflightReplication.get(containerId). The replicate action time out is 10 
> minutes, if action timeout, scm will delete the action from 
> inflightReplication.get(containerId) as the image shows. Then (container 
> replicates num + inflightReplication.get(containerId).size() - 
> inflightDeletion.get(containerId).size()) is less than 3 again, and scm ask 
> another datanode to replicate the container.
> Because replicate container cost a long time,  sometimes it cannot finish in 
> 10 minutes, thus 31 datanodes has to replicate the container every 10 
> minutes.  19 of 31 datanodes replicate container from the same source 
> datanode,  it will also cause big pressure on the source datanode and 
> replicate container become slower. Actually it cost 4 hours to finish the 
> first replicate. 
>  !screenshot-4.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to