[ 
https://issues.apache.org/jira/browse/HDFS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Sirianni updated HDFS-5380:
--------------------------------

    Attachment: ExcessReplicaPruningTest.java

JUnit test that demonstrates this issue using {{MiniDFSCluster}}

> NameNode returns stale block locations to clients during excess replica 
> pruning
> -------------------------------------------------------------------------------
>
>                 Key: HDFS-5380
>                 URL: https://issues.apache.org/jira/browse/HDFS-5380
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.0.0-alpha, 1.2.1
>            Reporter: Eric Sirianni
>            Priority: Minor
>         Attachments: ExcessReplicaPruningTest.java
>
>
> Consider the following contrived example:
> {code}
> // Step 1: Create file with replication factor = 2
> Path path = ...;
> short replication = 2;
> OutputStream os = fs.create(path, ..., replication, ...);
> // Step 2: Write to file
> os.write(...);
> // Step 3: Reduce replication factor to 1
> fs.setReplication(path, 1);
> // Wait for namenode to prune excess replicates
> // Step 4: Read from file
> InputStream is = fs.open(path);
> is.read(...);
> {code}
> During the read in _Step 4_, the {{DFSInputStream}} client receives "stale" 
> block locations from the NameNode.  Specifically, it receives block locations 
> that the NameNode has already pruned/invalidated (and the DataNodes have 
> already deleted).
> The net effect of this is unnecessary churn in the {{DFSClient}} (timeouts, 
> retries, extra RPCs, etc.).  In particular:
> {noformat}
> WARN  hdfs.DFSClient - Failed to connect to datanode-1 for block, add to 
> deadNodes and continue.
> {noformat}
> The blacklisting of DataNodes that are, in fact, functioning properly can 
> lead to inefficient locality of reads.  Since the blacklist is _cumulative_ 
> across all blocks in the file, this can have noticeable impact for large 
> files.
> A pathological case can occur when *all* block locations are in the 
> blacklist.  In this case, the {{DFSInputStream}} will sleep and refetch 
> locations from the NameNode, causing unnecessary RPCs and a client-side 
> sleep:  
> {noformat}
> INFO  hdfs.DFSClient - Could not obtain blk_1073741826_1002 from any node: 
> java.io.IOException: No live nodes contain current block. Will get new block 
> locations from namenode and retry...
> {noformat}
> This pathological case can occur in the following example (for a read of file 
> {{foo}}):
> # {{DFSInputStream}} attempts to read block 1 of {{foo}}.
> # Gets locations: {{( dn1(stale), dn2 )}}
> # Attempts read from {{dn1}}.  Fails.  Adds {{dn1}} to blacklist.
> # {{DFSInputStream}} attempts to read block 2 of {{foo}}.
> # Gets locations: {{( dn1, dn2(stale) )}}
> # Attempts read from {{dn2}} ({{dn1}} already blacklisted).  Fails.  Adds 
> {{dn1}} to blacklist.
> # All locations for block 2 are now in blacklist.
> # Clears blacklists
> # Sleeps up to 3 seconds
> # Refetches locations from the NameNode
> A solution would be to change the NameNode to not return stale block 
> locations to clients for replicas that it knows it has asked DataNodes to 
> invalidate.
> A quick look at the {{BlockManager.chooseExcessReplicates()}} code path seems 
> to indicate that the NameNode does not actually remove the pruned replica 
> from the BlocksMap until the subsequent blockReport is received.  This can 
> leave a substantial window where the NameNode can return stale replica 
> locations to clients.  
> If the NameNode were to proactively update the {{BlocksMap}} upon excess 
> replica pruning, this situation could be avoided.  If the DataNode did not in 
> fact invalidate the replica as asked, the NameNode would simply re-add the 
> replica to the {{BlocksMap}} upon next blockReport and go through the pruning 
> exercise again.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to