[ https://issues.apache.org/jira/browse/HDFS-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998534#comment-15998534 ]
Kihwal Lee commented on HDFS-11742: ----------------------------------- Instead of reverting, I am making a simple change to make it more usable. This will prevent users from hitting the same issues we had. The changes from HDFS-8188 does allow running balancer at a higher throughput, but it needs turning multiple knobs to get there. And when it is running slower than the previous release, users will have no clue why it is so. The default config values may result in degraded performance for users running a cluster with more than 20 nodes. The main problem of HDFS-8188 is the way thread pool is created per target. If it reaches the limit (max mover threads), the remaining pending moves are simply dropped (Or even worse, it hangs without HDFS-11377), leading to degraded performance as demonstrated above with graphs. The suggested workaround of "set the mover thread limit to 10,000 or 30,000" simply means removing the limit. i.e. it cannot work with the limit. The suggested improvement calculates the size of each mover thread pool, instead of using the configured fixed value. The total thread count limit is honored without causing the degradation seen with the original design. > Improve balancer usability after HDFS-8188 > ------------------------------------------ > > Key: HDFS-11742 > URL: https://issues.apache.org/jira/browse/HDFS-11742 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Blocker > Attachments: balancer2.8.png, HDFS-11742.branch-2.8.patch, > HDFS-11742.branch-2.patch, HDFS-11742.trunk.patch > > > We ran 2.8 balancer with HDFS-8818 on a 280-node and a 2,400-node cluster. In > both cases, it would hang forever after two iterations. The two iterations > were also moving things at a significantly lower rate. The hang itself is > fixed by HDFS-11377, but the design limitation remains, so the balancer > throughput ends up actually lower. > Instead of reverting HDFS-8188 as originally suggested, I am making a small > change to make it less error prone and more usable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org