[ 
https://issues.apache.org/jira/browse/RATIS-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285515#comment-16285515
 ] 

Jing Zhao commented on RATIS-160:
---------------------------------

bq. Actually the bug occurs when new leader creates a new log entry suppose e2 
for the entry which is already in log e1 and then tries to commit it. It is 
able to create a new log entry because the retry cache entry for the entry e1 
is failed. 

Yes, I understand how we hit the exception.... But still, the real cause of the 
bug is, when s1 becomes the leader again (in the step 5 of you described 
scenario), it should correctly replace a failed retry cache entry when applying 
log entries to its state machine. If we fix that part, when the client's retry 
request comes, the retry cache works as expected.

Let me try to explain why your current fix is not good from another 
perspective. In your patch, the main usage of the new "entryInLog" is:
{code}
202         } else if (!cacheEntry.isDone() || !cacheEntry.isFailed() || 
cacheEntry.entryInLog){
203           // the previous attempt is either pending or successful
204           return new CacheQueryResult(cacheEntry, true);
205         }
{code}
This means the code also returns a cache query result if we find that the entry 
is 1) done, and 2) failed, and 3) has entryInLog set to true. This is wrong. At 
this scenario, we should NOT return the query result. Instead we should replace 
the original retry cache entry with a new pending one.

This is a tricky issue. Will be happy to discuss it offline through google 
hangout.

> Retry cache should handle leader change after log commit
> --------------------------------------------------------
>
>                 Key: RATIS-160
>                 URL: https://issues.apache.org/jira/browse/RATIS-160
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Lokesh Jain
>            Assignee: Lokesh Jain
>         Attachments: RATIS-160.001.patch
>
>
> This jira is in relation to the below exception seen in the logs. 
> {code:java}
> java.lang.IllegalStateException: retry cache entry should be pending: 
> client-89341C13-2136-4EF3-BD8A-73C2526B7703:1777:done
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
>         at 
> org.apache.ratis.server.impl.RetryCache.getOrCreateEntry(RetryCache.java:169)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.replyPendingRequest(RaftServerImpl.java:915)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:974)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
>         at java.lang.Thread.run(Thread.java:748)
> Exception in thread "StateMachineUpdater-s3" 
> org.apache.ratis.util.ExitUtils$ExitException: StateMachineUpdater-s3: the 
> StateMachineUpdater hits Throwable
>         at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:94)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:175)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: retry cache entry should be 
> pending: client-89341C13-2136-4EF3-BD8A-73C2526B7703:1777:done
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
>         at 
> org.apache.ratis.server.impl.RetryCache.getOrCreateEntry(RetryCache.java:169)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.replyPendingRequest(RaftServerImpl.java:915)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:974)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
>         ... 1 more
> {code}
> This occurs when leader commits a log entry but is not able to send a reply 
> to the client before leader is changed. When the new leader gets the request 
> it sends the append entry request to the followers whose cache already has 
> the  corresponding entry leading to the above exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to