[ https://issues.apache.org/jira/browse/HDFS-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787413#comment-13787413 ]
Brandon Li commented on HDFS-5299: ---------------------------------- +1. Patch looks good. > DFS client hangs in updatePipeline RPC when failover happened > ------------------------------------------------------------- > > Key: HDFS-5299 > URL: https://issues.apache.org/jira/browse/HDFS-5299 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.0.0, 2.1.0-beta > Reporter: Vinay > Assignee: Vinay > Priority: Blocker > Attachments: HDFS-5299.000.patch, HDFS-5299.patch > > > DFSClient got hanged in updatedPipeline call to namenode when the failover > happened at exactly sametime. > When we digged down, issue found to be with handling the RetryCache in > updatePipeline. > Here are the steps : > 1. Client was writing slowly. > 2. One of the datanode was down and updatePipeline was called to ANN. > 3. Call reached the ANN, while processing updatePipeline call it got shutdown. > 3. Now Client retried (Since the api marked as AtMostOnce) to another > NameNode. at that time still NN was in STANDBY and got StandbyException. > 4. Now one more time client failover happened. > 5. Now SNN became Active. > 6. Client called to current ANN again for updatePipeline, > Now client call got hanged in NN, waiting for the cached call with same > callid to be over. But this cached call is already got over last time with > StandbyException. > Conclusion : > Always whenever the new entry is added to cache we need to update the result > of the call before returning the call or throwing exception. > I can see similar issue multiple RPCs in FSNameSystem. -- This message was sent by Atlassian JIRA (v6.1#6144)