[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838709#comment-16838709 ]
Guus der Kinderen commented on DIRMINA-1107: -------------------------------------------- Thanks for all the feedback! For what it's worth: we're currently experimenting with a code change that intends to: * Ensure that queues are emptied at least as often as {{flushScheduledEvents}} is invoked (which should prevent that events remain in the queue indefinitely after the flush has been requested - we can't be sure that another flush is requested in a timely manner). * Not block any threads (continue to allow for the 'tryLock' if statement, to prevent worker threads from being blocked). We are attempting to do this by introducing a lock that is released, atomically, only after it has been "used" as often as the lock has attempted to be acquired. Our work-in-progress code: {code:java}private class CountReentrantLock { private int i = 0; private final ReentrantLock lock = new ReentrantLock(); public synchronized boolean tryAcquireLock() { i++; return lock.tryLock(); } public synchronized boolean tryUnlock() { i--; if(i<=0) { i=0; lock.unlock(); return true; } return false; } public synchronized void forceUnlock() { i = 0; lock.unlock(); } } {code} This then could be used like this: {code:java} void flushScheduledEvents() { if (sslLock.tryAcquireLock()) { try { do { while ((event = filterWriteEventQueue.poll()) != null) { // ... } while ((event = messageReceivedEventQueue.poll()) != null){ // ... } } while (!filterWriteEventQueue.isEmpty() || !messageReceivedEventQueue.isEmpty() || !sslLock.tryUnlock()); } catch( Throwable t) { sslLock.forceUnlock(); } } }{code} We've not tested this code yet. A concern that we haven't thought through yet is re-entry of a thread that already owns the lock. > SslHandler flushScheduledEvents race condition, redux > ----------------------------------------------------- > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug > Affects Versions: 2.1.2 > Reporter: Guus der Kinderen > Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)