Eric Sirianni created HDFS-5380:
-----------------------------------

             Summary: NameNode returns stale block locations to clients during 
excess replica pruning
                 Key: HDFS-5380
                 URL: https://issues.apache.org/jira/browse/HDFS-5380
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 1.2.1, 2.0.0-alpha
            Reporter: Eric Sirianni
            Priority: Minor


Consider the following contrived example:
{code}
// Step 1: Create file with replication factor = 2
Path path = ...;
short replication = 2;
OutputStream os = fs.create(path, ..., replication, ...);

// Step 2: Write to file
os.write(...);

// Step 3: Reduce replication factor to 1
fs.setReplication(path, 1);
// Wait for namenode to prune excess replicates

// Step 4: Read from file
InputStream is = fs.open(path);
is.read(...);
{code}

During the read in _Step 4_, the {{DFSInputStream}} client receives "stale" 
block locations from the NameNode.  Specifically, it receives block locations 
that the NameNode has already pruned/invalidated (and the DataNodes have 
already deleted).

The net effect of this is unnecessary churn in the {{DFSClient}} (timeouts, 
retries, extra RPCs, etc.).  In particular:
{noformat}
WARN  hdfs.DFSClient - Failed to connect to datanode-1 for block, add to 
deadNodes and continue.
{noformat}

The blacklisting of DataNodes that are, in fact, functioning properly can lead 
to inefficient locality of reads.  Since the blacklist is _cumulative_ across 
all blocks in the file, this can have noticeable impact for large files.

A pathological case can occur when *all* block locations are in the blacklist.  
In this case, the {{DFSInputStream}} will sleep and refetch locations from the 
NameNode, causing unnecessary RPCs and a client-side sleep:  
{noformat}
INFO  hdfs.DFSClient - Could not obtain blk_1073741826_1002 from any node: 
java.io.IOException: No live nodes contain current block. Will get new block 
locations from namenode and retry...
{noformat}

This pathological case can occur in the following example (for a read of file 
{{foo}}):
# {{DFSInputStream}} attempts to read block 1 of {{foo}}.
# Gets locations: {{( dn1(stale), dn2 )}}
# Attempts read from {{dn1}}.  Fails.  Adds {{dn1}} to blacklist.
# {{DFSInputStream}} attempts to read block 2 of {{foo}}.
# Gets locations: {{( dn1, dn2(stale) )}}
# Attempts read from {{dn2}} ({{dn1}} already blacklisted).  Fails.  Adds 
{{dn1}} to blacklist.
# All locations for block 2 are now in blacklist.
# Clears blacklists
# Sleeps up to 3 seconds
# Refetches locations from the NameNode

A solution would be to change the NameNode to not return stale block locations 
to clients for replicas that it knows it has asked DataNodes to invalidate.

A quick look at the {{BlockManager.chooseExcessReplicates()}} code path seems 
to indicate that the NameNode does not actually remove the pruned replica from 
the BlocksMap until the subsequent blockReport is received.  This can leave a 
substantial window where the NameNode can return stale replica locations to 
clients.  

If the NameNode were to proactively update the {{BlocksMap}} upon excess 
replica pruning, this situation could be avoided.  If the DataNode did not in 
fact invalidate the replica as asked, the NameNode would simply re-add the 
replica to the {{BlocksMap}} upon next blockReport and go through the pruning 
exercise again.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to