[ https://issues.apache.org/jira/browse/GEODE-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312081#comment-16312081 ]
ASF subversion and git services commented on GEODE-4096: -------------------------------------------------------- Commit 6c37ff4a2a4fc6d609626b172793e3a123f9be82 in geode's branch refs/heads/develop from [~nnag] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=6c37ff4 ] GEODE-4096: Fixed race condition for connection global variable * Information on how the race condition occurs is provided in the GEODE-4096 ticket. * getConnection before returning null and clearing out the global variable connection calls stop on the dispatcher. * This makes sure that AckReaderThreads for the dispatcher is shutdown and prevents lingering threads holding the connection life cycle lock. > Race Condition between ConcurrentSerialGatewaySenderEventProcessor stopper > thread and the _dispatchBatch method for the connection global variable. > --------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-4096 > URL: https://issues.apache.org/jira/browse/GEODE-4096 > Project: Geode > Issue Type: Bug > Components: wan > Reporter: nabarun > Assignee: nabarun > > *+Order of execution for this race condition to occur+*. > # _dispatchBatch is trying to dispatch a batch of events but was somehow > unsuccessful > # It silently decides that the remote server may not be ready so it wants to > retry > # Same time we decide to stop the SerialGatewaySenderEventProcessor hence we > call the Stopper Thread. > # Before the threads are started on all the senders / dispatchers it sets the > isStopped flag for the SerialGatewaySenderEventProcessor to true. > # Then the _dispatchBatch method which was in retry mode makes a > getConnection call to get the connection. This method does a check on the > SerialGatewaySenderEventProcessor's isStopped flag. It sees that the flag is > set and this return null. > # This null is stored in the global variable connection for the dispatcher. > # Now that the _dispatchBatch method calls sees that the connection is null > it should raise an exception and destroyConnection. > # Meanwhile there was a AckThreadReader that was running and the stopper > thread for the event processor wants to stop it, but since the connection > global variable was set to null by the get connection method call by > _disptachBatch. > # Hence the shutDownAckReaderThreadConnection is executed on null and hence > the AckReaderThread continues to keep running - being stuck on socketRead0. > # But the problem is that the AckReaderThread acquire a > connectionLifeCycle.readLock. to readAcknowledgement, but the > destroyConnection calls from the stopper thread and _dispatchBatch's > exception handling code needs a connectionLifeCycleLock.writeLock which they > can't because readLock is held by the AckReaderThread, causing a deadlock -- This message was sent by Atlassian JIRA (v6.4.14#64029)