[jira] [Commented] (GEODE-697) A client thread timing out an operation and performing further operations can result in cache inconsistency

2017-02-13 Thread Hitesh Khamesra (JIRA)

[ 
https://issues.apache.org/jira/browse/GEODE-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864856#comment-15864856
 ] 

Hitesh Khamesra commented on GEODE-697:
---

I think this  looks resonable  solution[~dschneider]

> A client thread timing out an operation and performing further operations can 
> result in cache inconsistency
> ---
>
> Key: GEODE-697
> URL: https://issues.apache.org/jira/browse/GEODE-697
> Project: Geode
>  Issue Type: Bug
>  Components: client/server
>Reporter: Dan Smith
>
> There is a case where the primary and secondary buckets of a partitioned 
> region can become out of sync if a client times out while waiting for a slow 
> operation to finish. Here's the scenario:
> 1. A operation is started by the client and gets stuck on the server, for 
> example by a slow cache writer. That operation is assigned an EventID  with a 
> sequence number of 1.
> 2. The client times out.
> 3. The client performs a second operation. That operation gets assigned an 
> EventID with a sequence number of 2.
> 4. The second operation is applied on all members. The EventTracker records 
> the sequence number 2.
> 5. The original operation continues. It is applied to the primary (because it 
> has passed the EventTracker test).
> 6. The original operation is rejected by the EventTracker on the secondary. 
> The two copies of the bucket are now inconsistent.
> One possible fix is to change the thread id of the thread on the client when 
> the client operation times out. That would ensure that the EventTracker will 
> not reject the original operation when it finally goes through, because it 
> has a different thread id.
> If an operation is delayed on the server, for example by a very slow cache 
> writer, the operation can time out on the client.
> The client can then go on and perform a second operation.
> The problem is that each operation is assigned an event id which is a 
> combination of the clients thread id and a sequence number. That second 
> operation has a higher sequence number.
> Once the second operation is applied to a region on a given member, the event 
> is stored in the EventTracker and that member will reject any lower sequence 
> numbers



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GEODE-697) A client thread timing out an operation and performing further operations can result in cache inconsistency

2017-02-13 Thread Darrel Schneider (JIRA)

[ 
https://issues.apache.org/jira/browse/GEODE-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864748#comment-15864748
 ] 

Darrel Schneider commented on GEODE-697:


Anil and I both think the solution to this is that when a client gives up on an 
operation that it times out and is not going to retry that operation, then it 
must change its thread id that is used by the server EventTracker. I think the 
client adds this to every message it sends by getting it from a thread id. So 
it should be pretty easy for it to modify it in the exceptional case of it 
giving up on an in progress operation. This was the solution that Dan suggested 
in the original comment.

I think we only need the EventTracker because clients may retry to same 
operation and we want to not do duplicates on the server side when a client has 
done a retry. So the client only needs to maintain the same EventTracker id if 
it may still retry an operation that may still be in progress.

> A client thread timing out an operation and performing further operations can 
> result in cache inconsistency
> ---
>
> Key: GEODE-697
> URL: https://issues.apache.org/jira/browse/GEODE-697
> Project: Geode
>  Issue Type: Bug
>  Components: client/server
>Reporter: Dan Smith
>Assignee: Darrel Schneider
>
> There is a case where the primary and secondary buckets of a partitioned 
> region can become out of sync if a client times out while waiting for a slow 
> operation to finish. Here's the scenario:
> 1. A operation is started by the client and gets stuck on the server, for 
> example by a slow cache writer. That operation is assigned an EventID  with a 
> sequence number of 1.
> 2. The client times out.
> 3. The client performs a second operation. That operation gets assigned an 
> EventID with a sequence number of 2.
> 4. The second operation is applied on all members. The EventTracker records 
> the sequence number 2.
> 5. The original operation continues. It is applied to the primary (because it 
> has passed the EventTracker test).
> 6. The original operation is rejected by the EventTracker on the secondary. 
> The two copies of the bucket are now inconsistent.
> One possible fix is to change the thread id of the thread on the client when 
> the client operation times out. That would ensure that the EventTracker will 
> not reject the original operation when it finally goes through, because it 
> has a different thread id.
> If an operation is delayed on the server, for example by a very slow cache 
> writer, the operation can time out on the client.
> The client can then go on and perform a second operation.
> The problem is that each operation is assigned an event id which is a 
> combination of the clients thread id and a sequence number. That second 
> operation has a higher sequence number.
> Once the second operation is applied to a region on a given member, the event 
> is stored in the EventTracker and that member will reject any lower sequence 
> numbers



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (GEODE-697) A client thread timing out an operation and performing further operations can result in cache inconsistency

2017-01-16 Thread Udo Kohlmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/GEODE-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824466#comment-15824466
 ] 

Udo Kohlmeyer commented on GEODE-697:
-

I agree with [~hitesh.khamesra] that the most important piece is to keep the 
cache in a consistent state.

[~bschuchardt] I think that [~hitesh.khamesra] has a point in that we should 
accept any event from the primary bucket. In a partitioned region all CUD 
operations are handled through the primary, which means that if multiple 
clients where to make changes to the same key, they will be ordered (and 
blocked) by the primary node. This functionality should be checked with 
replicate regions and how they would be affected by them.

I agree with [~bschuchardt] in relation to the "clients should not timeout". I 
believe that the timeout should be honored by the server. This way if an 
operation has not been completed within a timeout period, then the server can, 
if possible, cancel the action/operation and return to the client with an 
"OperationTimeoutException". I've created GEODE-2304 to track this.



> A client thread timing out an operation and performing further operations can 
> result in cache inconsistency
> ---
>
> Key: GEODE-697
> URL: https://issues.apache.org/jira/browse/GEODE-697
> Project: Geode
>  Issue Type: Bug
>  Components: regions
>Reporter: Dan Smith
>Assignee: Bruce Schuchardt
>
> There is a case where the primary and secondary buckets of a partitioned 
> region can become out of sync if a client times out while waiting for a slow 
> operation to finish. Here's the scenario:
> 1. A operation is started by the client and gets stuck on the server, for 
> example by a slow cache writer. That operation is assigned an EventID  with a 
> sequence number of 1.
> 2. The client times out.
> 3. The client performs a second operation. That operation gets assigned an 
> EventID with a sequence number of 2.
> 4. The second operation is applied on all members. The EventTracker records 
> the sequence number 2.
> 5. The original operation continues. It is applied to the primary (because it 
> has passed the EventTracker test).
> 6. The original operation is rejected by the EventTracker on the secondary. 
> The two copies of the bucket are now inconsistent.
> One possible fix is to change the thread id of the thread on the client when 
> the client operation times out. That would ensure that the EventTracker will 
> not reject the original operation when it finally goes through, because it 
> has a different thread id.
> If an operation is delayed on the server, for example by a very slow cache 
> writer, the operation can time out on the client.
> The client can then go on and perform a second operation.
> The problem is that each operation is assigned an event id which is a 
> combination of the clients thread id and a sequence number. That second 
> operation has a higher sequence number.
> Once the second operation is applied to a region on a given member, the event 
> is stored in the EventTracker and that member will reject any lower sequence 
> numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)