[ https://issues.apache.org/jira/browse/HDFS-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786385#comment-13786385 ]
Uma Maheswara Rao G commented on HDFS-5299: ------------------------------------------- Nit: {code} cluster = new MiniDFSCluster.Builder(new Configuration()) + .nnTopology(MiniDFSNNTopology.simpleHATopology()).numDataNodes(1) + .build(); {code} Please reuse the existing conf object, you need not create new one for it. Also please add small javadoc for the test. {code} CacheEntryWithPayload cacheEntry = RetryCache.waitForCompletion(retryCache, null); if (cacheEntry != null && cacheEntry.isSuccess()) { return (String) cacheEntry.getPayload(); } final FSPermissionChecker pc = getPermissionChecker(); {code} Here if getPermissionChecker throws exception, then similar situation can occur for that call? We will not retry for this exception I think, but the pattern to wait for retry cache and setting state should be proper order to avoid situations like this. > DFS client hangs in updatePipeline RPC when failover happened > ------------------------------------------------------------- > > Key: HDFS-5299 > URL: https://issues.apache.org/jira/browse/HDFS-5299 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.0.0, 2.1.0-beta > Reporter: Vinay > Assignee: Vinay > Priority: Blocker > Attachments: HDFS-5299.patch > > > DFSClient got hanged in updatedPipeline call to namenode when the failover > happened at exactly sametime. > When we digged down, issue found to be with handling the RetryCache in > updatePipeline. > Here are the steps : > 1. Client was writing slowly. > 2. One of the datanode was down and updatePipeline was called to ANN. > 3. Call reached the ANN, while processing updatePipeline call it got shutdown. > 3. Now Client retried (Since the api marked as AtMostOnce) to another > NameNode. at that time still NN was in STANDBY and got StandbyException. > 4. Now one more time client failover happened. > 5. Now SNN became Active. > 6. Client called to current ANN again for updatePipeline, > Now client call got hanged in NN, waiting for the cached call with same > callid to be over. But this cached call is already got over last time with > StandbyException. > Conclusion : > Always whenever the new entry is added to cache we need to update the result > of the call before returning the call or throwing exception. > I can see similar issue multiple RPCs in FSNameSystem. -- This message was sent by Atlassian JIRA (v6.1#6144)