[ 
https://issues.apache.org/jira/browse/HDFS-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998534#comment-15998534
 ] 

Kihwal Lee commented on HDFS-11742:
-----------------------------------

Instead of reverting, I am making a simple change to make it more usable.  This 
will prevent users from hitting the same issues we had.  The changes from 
HDFS-8188 does allow running balancer at a higher throughput, but it needs 
turning multiple knobs to get there.  And when it is running slower than the 
previous release, users will have no clue why it is so. The default config 
values may result in degraded performance for users running a cluster with more 
than 20 nodes.

The main problem of HDFS-8188 is the way thread pool is created per target.  If 
it reaches the limit (max mover threads), the remaining pending moves are 
simply dropped (Or even worse, it hangs without HDFS-11377), leading to 
degraded performance as demonstrated above with graphs.  The suggested 
workaround of "set the mover thread limit to 10,000 or 30,000" simply means 
removing the limit. i.e. it cannot work with the limit.

The suggested improvement calculates the size of each mover thread pool, 
instead of using the configured fixed value.  The total thread count limit is 
honored without causing the degradation seen with the original design. 


> Improve balancer usability after HDFS-8188
> ------------------------------------------
>
>                 Key: HDFS-11742
>                 URL: https://issues.apache.org/jira/browse/HDFS-11742
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Blocker
>         Attachments: balancer2.8.png, HDFS-11742.branch-2.8.patch, 
> HDFS-11742.branch-2.patch, HDFS-11742.trunk.patch
>
>
> We ran 2.8 balancer with HDFS-8818 on a 280-node and a 2,400-node cluster. In 
> both cases, it would hang forever after two iterations. The two iterations 
> were also moving things at a significantly lower rate. The hang itself is 
> fixed by HDFS-11377, but the design limitation remains, so the balancer 
> throughput ends up actually lower.
> Instead of reverting HDFS-8188 as originally suggested, I am making a small 
> change to make it less error prone and more usable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to