Hi,

We have isolated what we think is a synchronization issue during data 
gravitation over multiple nodes using buddy replication.We have a unit test 
demonstrating the issue which I can send to anyone interested.

What appears to happen is this: When two nodes are involved in a data 
gravitation sometimes multiple data gravitation cleanup commands are issued of 
which one blocks the other. The calling node then times out after 
"buddyCommunicationTimeout" milliseconds (a timeout which is only logged as 
debug?) and returns null, making it look like the requested data does not exist 
in the cache. Further investigation reveals two global transactions on the data 
holding cache, one which holds an identity lock (write) for a backup data node 
and the other waiting to lock the same node.

Depending on "LockAcquisitionTimeout" the blocked request may continue, but we 
have seen several consequences of this depending on whether a user transaction 
is involved or not. Sometimes the lock seems to disappear and sometimes it 
doesn't and the application is in effect completely blocked (as the jgroup 
thread will be holding a lock on a NakReceiverWindow).  

This behavior only occurs (as far as we've seen) when there's quite a bit of 
concurrent access, in particular: when data is added to one node but accessed 
immediately on another, ie. when addition and gravitation occurs immediately. 
We have tried disabling auto gravitation and generally playing around with the 
configuration but with no effect. 

This seems like a synchronization issue and the unit test I can send along also 
shows that it is intermittent, sometimes the test will go through only to fail 
the next time and sometime a particular test fails when run standalone only to 
succeed if another test was run immediately before making it look like burn-in 
affects the result (which it of course may do).

Our unit test tries to model high concurrency in three different variations, 
one with a simple data gravitation, one with gravitation followed by a 
modification (subsequent put on the cache) and one with gravitation and 
modification within a user transaction. The last of these scenarios is 
basically what our real application is doing. 

I've spent considerable time looking at this so feel free to ask questions. 
Also, tell me where to send the unit test if you want to have a look at it. The 
unit test repeatedly fails on a java5/linux/dual core/cache 2.5.0.GA setup. 

Regards
/Lars J. Nilsson
www.cubeia.com

View the original post : 
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4089200#4089200

Reply to the post : 
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4089200
_______________________________________________
jboss-user mailing list
jboss-user@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/jboss-user

Reply via email to