Eric Sirianni created HDFS-5380: ----------------------------------- Summary: NameNode returns stale block locations to clients during excess replica pruning Key: HDFS-5380 URL: https://issues.apache.org/jira/browse/HDFS-5380 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1, 2.0.0-alpha Reporter: Eric Sirianni Priority: Minor
Consider the following contrived example: {code} // Step 1: Create file with replication factor = 2 Path path = ...; short replication = 2; OutputStream os = fs.create(path, ..., replication, ...); // Step 2: Write to file os.write(...); // Step 3: Reduce replication factor to 1 fs.setReplication(path, 1); // Wait for namenode to prune excess replicates // Step 4: Read from file InputStream is = fs.open(path); is.read(...); {code} During the read in _Step 4_, the {{DFSInputStream}} client receives "stale" block locations from the NameNode. Specifically, it receives block locations that the NameNode has already pruned/invalidated (and the DataNodes have already deleted). The net effect of this is unnecessary churn in the {{DFSClient}} (timeouts, retries, extra RPCs, etc.). In particular: {noformat} WARN hdfs.DFSClient - Failed to connect to datanode-1 for block, add to deadNodes and continue. {noformat} The blacklisting of DataNodes that are, in fact, functioning properly can lead to inefficient locality of reads. Since the blacklist is _cumulative_ across all blocks in the file, this can have noticeable impact for large files. A pathological case can occur when *all* block locations are in the blacklist. In this case, the {{DFSInputStream}} will sleep and refetch locations from the NameNode, causing unnecessary RPCs and a client-side sleep: {noformat} INFO hdfs.DFSClient - Could not obtain blk_1073741826_1002 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... {noformat} This pathological case can occur in the following example (for a read of file {{foo}}): # {{DFSInputStream}} attempts to read block 1 of {{foo}}. # Gets locations: {{( dn1(stale), dn2 )}} # Attempts read from {{dn1}}. Fails. Adds {{dn1}} to blacklist. # {{DFSInputStream}} attempts to read block 2 of {{foo}}. # Gets locations: {{( dn1, dn2(stale) )}} # Attempts read from {{dn2}} ({{dn1}} already blacklisted). Fails. Adds {{dn1}} to blacklist. # All locations for block 2 are now in blacklist. # Clears blacklists # Sleeps up to 3 seconds # Refetches locations from the NameNode A solution would be to change the NameNode to not return stale block locations to clients for replicas that it knows it has asked DataNodes to invalidate. A quick look at the {{BlockManager.chooseExcessReplicates()}} code path seems to indicate that the NameNode does not actually remove the pruned replica from the BlocksMap until the subsequent blockReport is received. This can leave a substantial window where the NameNode can return stale replica locations to clients. If the NameNode were to proactively update the {{BlocksMap}} upon excess replica pruning, this situation could be avoided. If the DataNode did not in fact invalidate the replica as asked, the NameNode would simply re-add the replica to the {{BlocksMap}} upon next blockReport and go through the pruning exercise again. -- This message was sent by Atlassian JIRA (v6.1#6144)