Derek Dagit created HDFS-4366:
---------------------------------

             Summary: Block Replication Policy Implementation May Skip 
Higher-Priority Blocks for Lower-Priority Blocks
                 Key: HDFS-4366
                 URL: https://issues.apache.org/jira/browse/HDFS-4366
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 0.23.5, 1.1.1, 3.0.0
            Reporter: Derek Dagit
            Assignee: Derek Dagit


In certain cases, higher-priority under-replicated blocks can be skipped by the 
replication policy implementation.  The current implementation maintains, for 
each priority level, an index into a list of blocks that are under-replicated.  
Together, the lists compose a priority queue (see note later about 
branch-0.23).  In some cases when blocks are removed from a list, the caller 
(BlockManager) properly handles the index into the list from which it removed a 
block.  In some other cases, the index remains stationary while the list 
changes.  Whenever this happens, and the removed block happened to be at or 
before the index, the implementation will skip over a block when selecting 
blocks for replication work.

In situations when entire racks are decommissioned, leading to many 
under-replicated blocks, loss of blocks can occur.


Background: HDFS-1765

This patch to trunk greatly improved the state of the replication policy 
implementation.  Prior to the patch, the following details were true:
        * The block "priority queue" was no such thing: It was really set of 
trees that held blocks in natural ordering, that being by the blocks ID, which 
resulted in iterator walks over the blocks in pseudo-random order.
        * There was only a single index into an iteration over all of the 
blocks...
        * ... meaning the implementation was only successful in respecting 
priority levels on the first pass.  Overall, the behavior was a 
round-robin-type scheduling of blocks.

After the patch
        * A proper priority queue is implemented, preserving log(n) operations 
while iterating over blocks in the order added.
        * A separate index for each priority is key is kept...
        * ... allowing for processing of the highest priority blocks first 
regardless of which priority had last been processed.

The change was suggested for branch-0.23 as well as trunk, but it does not 
appear to have been pulled in.


The problem:

Although the indices are now tracked in a better way, there is a 
synchronization issue since the indices are managed outside of methods to 
modify the contents of the queue.

Removal of a block from a priority level without adjusting the index can mean 
that the index then points to the block after the block it originally pointed 
to.  In the next round of scheduling for that priority level, the block 
originally pointed to by the index is skipped.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to