[ https://issues.apache.org/jira/browse/HDFS-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788174#comment-13788174 ]
Hudson commented on HDFS-5299: ------------------------------ FAILURE: Integrated in Hadoop-Mapreduce-trunk #1571 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1571/]) HDFS-5299. DFS client hangs in updatePipeline RPC when failover happened. Contributed by Vinay. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1529660) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeRetryCache.java > DFS client hangs in updatePipeline RPC when failover happened > ------------------------------------------------------------- > > Key: HDFS-5299 > URL: https://issues.apache.org/jira/browse/HDFS-5299 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.0.0, 2.1.0-beta > Reporter: Vinay > Assignee: Vinay > Priority: Blocker > Fix For: 2.2.0 > > Attachments: HDFS-5299.000.patch, HDFS-5299.patch > > > DFSClient got hanged in updatedPipeline call to namenode when the failover > happened at exactly sametime. > When we digged down, issue found to be with handling the RetryCache in > updatePipeline. > Here are the steps : > 1. Client was writing slowly. > 2. One of the datanode was down and updatePipeline was called to ANN. > 3. Call reached the ANN, while processing updatePipeline call it got shutdown. > 3. Now Client retried (Since the api marked as AtMostOnce) to another > NameNode. at that time still NN was in STANDBY and got StandbyException. > 4. Now one more time client failover happened. > 5. Now SNN became Active. > 6. Client called to current ANN again for updatePipeline, > Now client call got hanged in NN, waiting for the cached call with same > callid to be over. But this cached call is already got over last time with > StandbyException. > Conclusion : > Always whenever the new entry is added to cache we need to update the result > of the call before returning the call or throwing exception. > I can see similar issue multiple RPCs in FSNameSystem. -- This message was sent by Atlassian JIRA (v6.1#6144)