[ https://issues.apache.org/jira/browse/HDFS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Sirianni updated HDFS-5380: -------------------------------- Attachment: ExcessReplicaPruningTest.java JUnit test that demonstrates this issue using {{MiniDFSCluster}} > NameNode returns stale block locations to clients during excess replica > pruning > ------------------------------------------------------------------------------- > > Key: HDFS-5380 > URL: https://issues.apache.org/jira/browse/HDFS-5380 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.0-alpha, 1.2.1 > Reporter: Eric Sirianni > Priority: Minor > Attachments: ExcessReplicaPruningTest.java > > > Consider the following contrived example: > {code} > // Step 1: Create file with replication factor = 2 > Path path = ...; > short replication = 2; > OutputStream os = fs.create(path, ..., replication, ...); > // Step 2: Write to file > os.write(...); > // Step 3: Reduce replication factor to 1 > fs.setReplication(path, 1); > // Wait for namenode to prune excess replicates > // Step 4: Read from file > InputStream is = fs.open(path); > is.read(...); > {code} > During the read in _Step 4_, the {{DFSInputStream}} client receives "stale" > block locations from the NameNode. Specifically, it receives block locations > that the NameNode has already pruned/invalidated (and the DataNodes have > already deleted). > The net effect of this is unnecessary churn in the {{DFSClient}} (timeouts, > retries, extra RPCs, etc.). In particular: > {noformat} > WARN hdfs.DFSClient - Failed to connect to datanode-1 for block, add to > deadNodes and continue. > {noformat} > The blacklisting of DataNodes that are, in fact, functioning properly can > lead to inefficient locality of reads. Since the blacklist is _cumulative_ > across all blocks in the file, this can have noticeable impact for large > files. > A pathological case can occur when *all* block locations are in the > blacklist. In this case, the {{DFSInputStream}} will sleep and refetch > locations from the NameNode, causing unnecessary RPCs and a client-side > sleep: > {noformat} > INFO hdfs.DFSClient - Could not obtain blk_1073741826_1002 from any node: > java.io.IOException: No live nodes contain current block. Will get new block > locations from namenode and retry... > {noformat} > This pathological case can occur in the following example (for a read of file > {{foo}}): > # {{DFSInputStream}} attempts to read block 1 of {{foo}}. > # Gets locations: {{( dn1(stale), dn2 )}} > # Attempts read from {{dn1}}. Fails. Adds {{dn1}} to blacklist. > # {{DFSInputStream}} attempts to read block 2 of {{foo}}. > # Gets locations: {{( dn1, dn2(stale) )}} > # Attempts read from {{dn2}} ({{dn1}} already blacklisted). Fails. Adds > {{dn1}} to blacklist. > # All locations for block 2 are now in blacklist. > # Clears blacklists > # Sleeps up to 3 seconds > # Refetches locations from the NameNode > A solution would be to change the NameNode to not return stale block > locations to clients for replicas that it knows it has asked DataNodes to > invalidate. > A quick look at the {{BlockManager.chooseExcessReplicates()}} code path seems > to indicate that the NameNode does not actually remove the pruned replica > from the BlocksMap until the subsequent blockReport is received. This can > leave a substantial window where the NameNode can return stale replica > locations to clients. > If the NameNode were to proactively update the {{BlocksMap}} upon excess > replica pruning, this situation could be avoided. If the DataNode did not in > fact invalidate the replica as asked, the NameNode would simply re-add the > replica to the {{BlocksMap}} upon next blockReport and go through the pruning > exercise again. -- This message was sent by Atlassian JIRA (v6.1#6144)