[
https://issues.apache.org/jira/browse/HDFS-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860046#action_12860046
]
Hairong Kuang commented on HDFS-1105:
-------------------------------------
Dmytro, I really like the improvements you proposed. We observed similar issues
with the balancer in our clusters and are thinking a similar idea to limit the
elapsed time of each iteration. I took a quick look at your patch. One comment
is that making number of blocks to move in parallel to a given node may not be
useful because each datanode is also configured to move 5 blocks in parallel.
> 3) it can hit namenode and the network pretty hard
This probably is caused by the call NamenodeProtocol#getBlocks. The number of
returned blocks is limited by the total size. We should also have a limit on
the total number of blocks returned. So the response size can be bounded
ideally within 1M bytes.
> Balancer improvement
> --------------------
>
> Key: HDFS-1105
> URL: https://issues.apache.org/jira/browse/HDFS-1105
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Dmytro Molkov
> Attachments: HDFS-1105.patch
>
>
> We were seeing some weird issues with the balancer in our cluster:
> 1) it can get stuck during an iteration and only restarting it helps
> 2) the iterations are highly inefficient. With 20 minutes iteration it moves
> 7K blocks a minute for the first 6 minutes and hundreds of blocks in the next
> 14 minutes
> 3) it can hit namenode and the network pretty hard
> A few improvements we came up with as a result:
> Making balancer more deterministic in terms of running time of iteration,
> improving the efficiency and making the load configurable:
> Make many of the constants configurable command line parameters: Iteration
> length, number of blocks to move in parallel to a given node and in cluster
> overall.
> Terminate transfers that are still in progress after iteration is over.
> Previously iteration time was the time window in which the balancer was
> scheduling the moves and then it would wait for the moves to finish
> indefinitely. Each scheduling task can run up to iteration time or even
> longer. This means if you have too many of them and they are long your actual
> iterations are longer than 20 minutes. Now each scheduling task has a time of
> the start of iteration and it should schedule the moves only if it did not
> run out of time. So the tasks that have started after the iteration is over
> will not schedule any moves.
> The number of move threads and dispatch threads is configurable so that
> depending on the load of the cluster you can run it slower.
> I will attach a patch, please let me know what you think and what can be done
> better.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.