[ 
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907694#comment-13907694
 ] 

Bikas Saha commented on YARN-1410:
----------------------------------

I am repeatedly asking for this because its a problem that we will continue to 
face in other non-idempotent operations on different RM and NM protocols. We 
need to establish a consistent behavior that can be reused for all operations 
instead of operation specific workarounds that are brittle.

I spoke to [~sureshms] offline and he showed me RetryCache helper class that it 
present in hadoop common. And also the usage of that class in a non-idempotent 
FSNameSystem.delete() RPC. Can you please take a look at that code. We do not 
have to bother about client-id/call-id. RetryCache is taking care of all that 
for us. The main thing we have to do is use RetryCache methods properly and 
save the right information in the store such that the RetryCache can be 
re-populated after restart if needed. I am adding Suresh as a watcher to this 
jira. He has volunteered to help review/help understand the code.

Suresh also mentioned that the AtMostOnce etc annotations are supposed to be 
made on the RPC methods. The RetryCache kicks in only based on annotations on 
the protocol methods.

It would be good if we take some time and do this cleanly in an re-usable 
manner once so that work on the remaining API's can be made easier. If we use 
specific work arounds then I am concerned that these may come back to bite us 
later on.

> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
>                 Key: YARN-1410
>                 URL: https://issues.apache.org/jira/browse/YARN-1410
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>         Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, 
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed 
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create 
> app id) the new RM may reject the app submission resulting in unexpected 
> failure on the client side.
> The same may happen for other 2 step client API operations.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to