[ https://issues.apache.org/jira/browse/HDFS-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128740#comment-14128740 ]
Yongjun Zhang commented on HDFS-6621: ------------------------------------- HI [~ravwojdyla], thanks for your quick response, would you please rebase to latest trunk and submit new rev? it doesn't apply currently. Hi [~szetszwo], since the fix is a quite intuitive one liner, and it's pretty challenging to write a test case, do you think we can just commit the fix after jenkins? thanks. > Hadoop Balancer prematurely exits iterations > -------------------------------------------- > > Key: HDFS-6621 > URL: https://issues.apache.org/jira/browse/HDFS-6621 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer > Affects Versions: 2.2.0, 2.4.0 > Environment: Red Hat Enterprise Linux Server release 5.8 with Hadoop > 2.4.0 > Reporter: Benjamin Bowman > Labels: balancer > Attachments: HDFS-6621.patch, HDFS-6621.patch_2, HDFS-6621.patch_3, > HDFS-6621.patch_4 > > > I have been having an issue with the balancing being too slow. The issue was > not with the speed with which blocks were moved, but rather the balancer > would prematurely exit out of it's balancing iterations. It would move ~10 > blocks or 100 MB then exit the current iteration (in which it said it was > planning on moving about 10 GB). > I looked in the Balancer.java code and believe I found and solved the issue. > In the dispatchBlocks() function there is a variable, > "noPendingBlockIteration", which counts the number of iterations in which a > pending block to move cannot be found. Once this number gets to 5, the > balancer exits the overall balancing iteration. I believe the desired > functionality is 5 consecutive no pending block iterations - however this > variable is never reset to 0 upon block moves. So once this number reaches 5 > - even if there have been thousands of blocks moved in between these no > pending block iterations - the overall balancing iteration will prematurely > end. > The fix I applied was to set noPendingBlockIteration = 0 when a pending block > is found and scheduled. In this way, my iterations do not prematurely exit > unless there is 5 consecutive no pending block iterations. Below is a copy > of my dispatchBlocks() function with the change I made. > {code} > private void dispatchBlocks() { > long startTime = Time.now(); > long scheduledSize = getScheduledSize(); > this.blocksToReceive = 2*scheduledSize; > boolean isTimeUp = false; > int noPendingBlockIteration = 0; > while(!isTimeUp && getScheduledSize()>0 && > (!srcBlockList.isEmpty() || blocksToReceive>0)) { > PendingBlockMove pendingBlock = chooseNextBlockToMove(); > if (pendingBlock != null) { > noPendingBlockIteration = 0; > // move the block > pendingBlock.scheduleBlockMove(); > continue; > } > /* Since we can not schedule any block to move, > * filter any moved blocks from the source block list and > * check if we should fetch more blocks from the namenode > */ > filterMovedBlocks(); // filter already moved blocks > if (shouldFetchMoreBlocks()) { > // fetch new blocks > try { > blocksToReceive -= getBlockList(); > continue; > } catch (IOException e) { > LOG.warn("Exception while getting block list", e); > return; > } > } else { > // source node cannot find a pendingBlockToMove, iteration +1 > noPendingBlockIteration++; > // in case no blocks can be moved for source node's task, > // jump out of while-loop after 5 iterations. > if (noPendingBlockIteration >= MAX_NO_PENDING_BLOCK_ITERATIONS) { > setScheduledSize(0); > } > } > // check if time is up or not > if (Time.now()-startTime > MAX_ITERATION_TIME) { > isTimeUp = true; > continue; > } > /* Now we can not schedule any block to move and there are > * no new blocks added to the source block list, so we wait. > */ > try { > synchronized(Balancer.this) { > Balancer.this.wait(1000); // wait for targets/sources to be idle > } > } catch (InterruptedException ignored) { > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)