[ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907694#comment-13907694 ]
Bikas Saha commented on YARN-1410: ---------------------------------- I am repeatedly asking for this because its a problem that we will continue to face in other non-idempotent operations on different RM and NM protocols. We need to establish a consistent behavior that can be reused for all operations instead of operation specific workarounds that are brittle. I spoke to [~sureshms] offline and he showed me RetryCache helper class that it present in hadoop common. And also the usage of that class in a non-idempotent FSNameSystem.delete() RPC. Can you please take a look at that code. We do not have to bother about client-id/call-id. RetryCache is taking care of all that for us. The main thing we have to do is use RetryCache methods properly and save the right information in the store such that the RetryCache can be re-populated after restart if needed. I am adding Suresh as a watcher to this jira. He has volunteered to help review/help understand the code. Suresh also mentioned that the AtMostOnce etc annotations are supposed to be made on the RPC methods. The RetryCache kicks in only based on annotations on the protocol methods. It would be good if we take some time and do this cleanly in an re-usable manner once so that work on the remaining API's can be made easier. If we use specific work arounds then I am concerned that these may come back to bite us later on. > Handle client failover during 2 step client API's like app submission > --------------------------------------------------------------------- > > Key: YARN-1410 > URL: https://issues.apache.org/jira/browse/YARN-1410 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Bikas Saha > Assignee: Xuan Gong > Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, > YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > App submission involves > 1) creating appId > 2) using that appId to submit an ApplicationSubmissionContext to the user. > The client may have obtained an appId from an RM, the RM may have failed > over, and the client may submit the app to the new RM. > Since the new RM has a different notion of cluster timestamp (used to create > app id) the new RM may reject the app submission resulting in unexpected > failure on the client side. > The same may happen for other 2 step client API operations. -- This message was sent by Atlassian JIRA (v6.1.5#6160)