Guus der Kinderen created DIRMINA-1107:
------------------------------------------

             Summary: SslHandler flushScheduledEvents race condition, redux
                 Key: DIRMINA-1107
                 URL: https://issues.apache.org/jira/browse/DIRMINA-1107
             Project: MINA
          Issue Type: Bug
            Reporter: Guus der Kinderen


DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally 
replaces it with another multithreading issue.

The fix for DIRMINA-1019 introduces a counter that contains the number of 
events to be processed. A simplified version of the code is included below.
{code:java}
private final AtomicInteger scheduledEvents = new AtomicInteger(0);

void flushScheduledEvents() {
    scheduledEvents.incrementAndGet();

    if (sslLock.tryLock()) {            
        try {
            do {
                while ((event = filterWriteEventQueue.poll()) != null) {
                    // ...
                }
            
                while ((event = messageReceivedEventQueue.poll()) != null){
                    // ...
                }
            } while (scheduledEvents.decrementAndGet() > 0);
        } finally {
            sslLock.unlock();
        }
    }
}{code}
We have observed occasions where the value of {{scheduledEvents}} becomes a 
negative value, while at the same time {{filterWriteEventQueue}} go unprocessed.

We suspect that this issue is triggered by a concurrency issue caused by the 
first thread decrementing the counter after a second thread incremented it, but 
before it attempted to acquire the lock.

This allows the the first thread to empty the queues, decrementing the counter 
to zero and release the lock, after which the second thread acquires the lock 
successfully. Now, the second thread processes any elements in 
{{filterWriteEventQueue}}, and then processes any elements in 
{{messageReceivedEventQueue}}. If in between these two checks yet another 
thread adds a new element to {{filterWriteEventQueue}}, this element can go 
unprocessed (as the second thread does not loop, since the counter is zero or 
negative, and the third thread can fail to acquire the lock).

It's a seemingly unlikely scenario, but we are observing the behavior when our 
systems are under high load.

We've applied a code change after which this problem is no longer observed. 
We've removed the counter, and check on the size of the queues instead:

{code:java}
void flushScheduledEvents() {
    if (sslLock.tryLock()) {            
        try {
            do {
                while ((event = filterWriteEventQueue.poll()) != null) {
                    // ...
                }
            
                while ((event = messageReceivedEventQueue.poll()) != null){
                    // ...
                }
            } while (!filterWriteEventQueue.isEmpty() || 
!messageReceivedEventQueue.isEmpty());
        } finally {
            sslLock.unlock();
        }
    }
}{code}

This code change, as illustrated above, does introduce a new potential problem. 
Theoretically, an event could be added to the queues and 
{{flushScheduledEvents}} be called returning {{false}} for 
{{sslLock.tryLock()}}, exactly after another thread just finished the {{while}} 
loop, but before releasing the lock. This again would cause events to go 
unprocessed.

We've not observed this problem in the wild yet, but we're uncomfortable 
applying this change as-is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to