[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582869#comment-14582869 ]
Chengbing Liu commented on HDFS-7048: ------------------------------------- Here is a bit explanation for the patch. On our production cluster, the balancer worked slowly. For an iteration planning to move ~500GB data, the actual moved data would be ~5GB. After some digging, {{Source#dispatchBlocks()}} always exits prematurely at the following code, where I added a logging to inform user the anomalies. {code} // jump out of while-loop after 5 iterations. if (noPendingMoveIteration >= MAX_NO_PENDING_MOVE_ITERATIONS) { resetScheduledSize(); } {code} This is because we use a global {{Dispatcher.this}} for wait and notify, which will wake up all the unrelated {{Source}}s, even if they did not have any {{PendingMove}} finished. The correct way should be to wait and notify on the {{StorageGroup}}, both source and target, since the DataXceiver shares the threads for sending and receiving. As for the wait timeout, I think we might increase this a little bit to prevent timing out too often. Actually we are using 60 seconds now in our production cluster without problem. However, as I increase the timeout, some test cases will fail slowly or even time out. These test cases include some obviously unmovable cases, and should exit immediately in my opinion. But we can fix that later. > Incorrect Balancer#Source wait/notify leads to early termination > ---------------------------------------------------------------- > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover > Affects Versions: 2.6.0 > Reporter: Andrew Wang > Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)