[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488507#comment-17488507 ] Emmanuel Lécharny commented on DIRMINA-1107: backported to 2.0 branch > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Assignee: Jonathan Valliere >Priority: Major > Fix For: 2.1.3, 2.0.23 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@mina.apache.org For additional commands, e-mail: dev-h...@mina.apache.org
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845261#comment-16845261 ] Emmanuel Lecharny commented on DIRMINA-1107: Can we backport the fix to 2.0 ? > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844809#comment-16844809 ] Jonathan Valliere commented on DIRMINA-1107: Merged into 2.1.X [99fbf4fe64d72409b19f8835c0a568022916d840|https://gitbox.apache.org/repos/asf?p=mina.git;a=commit;h=99fbf4fe64d72409b19f8835c0a568022916d840] > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844611#comment-16844611 ] Emmanuel Lecharny commented on DIRMINA-1107: Great ! Jonathan, I let you merge the branch. We probably want to backport the change to the 2.0.X branch too. Let me know and I'll kick a release when done. Thanks a lot ! > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844606#comment-16844606 ] Guus der Kinderen commented on DIRMINA-1107: We've replaced our initial fix with what Jonathan provided. His fix appears to prevent the original issue from occurring. We're happy to close this issue with that fix! Thanks! > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843823#comment-16843823 ] Emmanuel Lecharny commented on DIRMINA-1107: Guus, I have checked Jonathan's patch on my machine (Mac OSX with Java 1.8.0_192, it works like a charm. Would you give it a try ? Thanks ! {noformat} $ git clone --single-branch http://gitbox.apache.org/repos/asf/mina.git -b DIRMINA-1107 mina-DIRMINA-1107 $ cd mina-DIRMINA-1107/ $ git status On branch DIRMINA-1107 Your branch is up to date with 'origin/DIRMINA-1107'. $ mvn clean install -Pserial ... [INFO] [INFO] Reactor Summary for Apache MINA 2.1.3-SNAPSHOT: [INFO] [INFO] Apache MINA SUCCESS [ 1.515 s] [INFO] Apache MINA Legal .. SUCCESS [ 1.327 s] [INFO] Apache MINA Core ... SUCCESS [02:00 min] [INFO] Apache MINA APR Transport .. SUCCESS [ 0.453 s] [INFO] Apache MINA Compression Filter . SUCCESS [ 1.049 s] [INFO] Apache MINA State Machine .. SUCCESS [ 1.336 s] [INFO] Apache MINA JavaBeans Integration .. SUCCESS [ 1.010 s] [INFO] Apache MINA XBean Integration .. SUCCESS [ 1.692 s] [INFO] Apache MINA OGNL Integration ... SUCCESS [ 0.281 s] [INFO] Apache MINA JMX Integration SUCCESS [ 0.290 s] [INFO] Apache MINA Examples ... SUCCESS [ 3.038 s] [INFO] Apache MINA HTTP client and server codec ... SUCCESS [ 0.951 s] [INFO] Apache MINA Serial Communication support ... SUCCESS [ 0.305 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 02:14 min [INFO] Finished at: 2019-05-20T11:46:13+02:00 [INFO] {noformat} > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > //
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843262#comment-16843262 ] Jonathan Valliere commented on DIRMINA-1107: The issue with SSL is that the order messages are encoded MUST be the same order they are written to the socket. The {{scheduledEvents}} atomic is essentially useless as implemented because it relies increment/decrements instead of CAS operations leading to inconsistent behaviors. The {{sslLock}} is also ONLY used in {{flushScheduledEvents}} which means it could be removed without negatively impacting other operations. In order to ensure that messages are pushed into the {{IoSession}} in the same order as they were encoded within the mutex on {{sslHandler}}. It is my recommendation to break {{flushScheudledEvents}} into flush operations for receive and write. Then ensure that all {{flushFilterWrite}} occurs within the mutex on {{sslHandler}} to ensure correct write order. This will cause write operations to be within the mutex longer than before but does not require any additional locks since all writing threads must obtain the write mutex anyway. This patch is pushed to [branch/DIRMINA-1107|https://gitbox.apache.org/repos/asf?p=mina.git;a=shortlog;h=refs/heads/DIRMINA-1107] > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is.
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842227#comment-16842227 ] Emmanuel Lecharny commented on DIRMINA-1107: Continuing my archeocodology, here are the situations where it's possible to write a message in the IoHandler while the handshake has not been completed : o for a server: just when the session is created, and if the {{SSLFilter}} is added to the session's chain, and if the handshake is started immediately, then the server will expect the next received message to be a {{CLIENT_HELLO}}. However, after having injected the filter in the chain, and created the {{SslHandler}} instance, and having started the handshake in server mode, then a {{sessionCreated}} followed by a {{sessionOpened}} events are called, which potentially gives the application to call a {{session.write(message)}}. Obviously, this message should *not* go through the {{SSLFilter}} before the handshake has been completed, it must be enqueued, waiting for the handshake to complete (or fail). o If the handshake takes a while, and if the session idling detection is activated (every second), the it's possible that the {{IoHandler}} gets called through the {{sessionIdling}} event, giving the opportunity for the application to write a message (ok, that is a remote possibility) Just writing that down for the sake of documentation... > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842072#comment-16842072 ] Emmanuel Lecharny commented on DIRMINA-1107: Forget about leveraging the {{event}} method: as it is initiated in the {{SslFilter}}, it won't be available in this very filter... The big issue I have atm is how to ensure that the messages we have enqueued will be flushed when and only when the handshake have been completed, and how to guarantee that no new message is going to be enqueued while we are flushing them. This can typically happen when we start flushing the existing pending messages, while some thread is writing a new message: we must wait until all the existing messages in the queue have been written. Also we don't want to keep this queue forever: once the handshake has been completed, and the queue flushed, it is not useful anymore. Actually, we need two barriers: * one for messages that are produced *before* the handshake has started but *after* the {{SslFilter}} has been added into the chain (this is only something to deal with when we inject the filter into an active session - like when using startTLS -) * one for messages that are produced during the handshake Between those two barriers, all messages are enqueued, after those two barriers, no message is enqueued. And the tricky part: while the queue is flushed, no message should be added into the queue, but we must not be flushed either (to guarantee the messages' order): {noformat} [entry barrier]---[exit barrier]--[queue flushed] |< messages are enqueued>|< messages are waiting >| {noformat} Now to find a way to code that... > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841505#comment-16841505 ] Emmanuel Lecharny commented on DIRMINA-1107: FTR, I just deleted a comment and an attached file, as I realized 5 mins after having pushed them I was wrong. The *only* reason we have a write queue is for when we have messages to be written while the handshake is not completed: we need to wait and keep those messages. My deleted comment was that we can bypass the {{SslHandler.scheduleFilterWrite()}} and push the writes to the next filter. That works, assuming no messages are written while the handshake is being processed (this is possibly what could happen when using {{startTLS}}). We can leverage the newly added {{event}} in {{2.1}} to start flushing those pending messages when the session has been secured, but in {{2.0}} it's going a bit more complex. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841490#comment-16841490 ] Emmanuel Lecharny commented on DIRMINA-1107: So here is my first proposed change in {{SslFilter}} (see attachement) that directly pushes writes to the next filter (aka {{HeadFilter}}) without having to use the {{SslHander.flushScheduledEvents()}} method (I haven't yet modified the {{SslHandler}} class to get rid of the write queue, but will do soon). Tests are passing green. More to come. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > Attachments: ssl.diff > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841103#comment-16841103 ] Guus der Kinderen commented on DIRMINA-1107: \o/ For what it's worth: since I wrote that comment, we have successfully applied the solution that I've outlined in https://issues.apache.org/jira/browse/DIRMINA-1107?focusedCommentId=16838709=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16838709 > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841096#comment-16841096 ] Emmanuel Lecharny commented on DIRMINA-1107: Don't worry, we hijacked the thread devising about some future changes. The fix will be applied in the existing branches, without breaking the APIs. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841084#comment-16841084 ] Guus der Kinderen commented on DIRMINA-1107: I'd love to see a solution for the 2.1 branch, even if that's a patch of the existing, sub-par, implementation. We're currently suffering from issues (which prompted me to create this issue), which we'd like to fix before/without upgrading to a new major release of MINA. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841071#comment-16841071 ] Emmanuel Lecharny commented on DIRMINA-1107: Agreed. In any case, that would be quite a disruptive change, and can't be injected in 2.0 or 2.1. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840626#comment-16840626 ] Jonathan Valliere commented on DIRMINA-1107: Honestly it doesn't really matter, would have to be invoked by the IoProcessor anyway. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840575#comment-16840575 ] Emmanuel Lecharny commented on DIRMINA-1107: It should never be a filter, that's for sure. Rather than integrating it in the session, I would put it in the IoProcessor, as close as possible to the channel. Why would you suggest to add this in the session ? (just curious) > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840551#comment-16840551 ] Jonathan Valliere commented on DIRMINA-1107: Pretty sure 3.0 design was to integrate SSL into the IoSession directly. This is the way that Netty does it. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840518#comment-16840518 ] Emmanuel Lecharny commented on DIRMINA-1107: We could actually write the fixed size buffer into the {{IoProcessor}} queue, and wait for the {{messageSent}} event in the {{SslFilter}} to write the next buffer. that would work, but not when having an executor in the chain (except if we enqueue the messages to be encrypted in a {{SslFilter}} queue. For a 3.0 version, this is clearly an option. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840470#comment-16840470 ] Jonathan Valliere commented on DIRMINA-1107: The only other option than large buffers would be to encrypt at socket write time which would require a lot of other changes and make filters before SSL impossible. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840433#comment-16840433 ] Emmanuel Lecharny commented on DIRMINA-1107: Side note: The way encrypted messages are handled is a bit mad. Consider a BIG message (say, 1Mb). SSL does not allow messages bigger than roughly 16kb (2^14 minus header, MAC and padding. Except for M$, of course, which allows up to 32kb messages...). In any case, we will start with a 16Kb buffer, call encrypt again and again until the full source has been read and encrypted, increasing the buffer as needed. This buffer will contain *many* TLS {{APPLICATION_DATA}} records (starting with {{0x17}}). We could perfectly chose to always use a fixed size buffer, and once full, send it to the remote peer, avoiding the allocation of a crazy big buffer which will be discarded when done. The only problem is that we would need to remember when we have fully written the source, inorder to properly send the {{messageSent}} event. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840425#comment-16840425 ] Jonathan Valliere commented on DIRMINA-1107: IMHO only the write mutex is needed to guarantee that writes are enqueued on the IoSession in the correct order. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840413#comment-16840413 ] Emmanuel Lecharny commented on DIRMINA-1107: Ok, after having analysed the {{filterWrite}} code a bit more, I don't see any reason to differ the writing of messages either. Actually, we should never use a queue to stack messages. All of this madness comes from a 'fix' applied 12 years ago (see [DIRMINA-241|https://issues.apache.org/jira/browse/DIRMINA-241]). So, yes, we may have multiple writes done on a session by different threads when using an executor in the chain, but that should be handled properly with the {{synchronized(sslHandler}}} in the {{SslFilter}}. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839462#comment-16839462 ] Emmanuel Lecharny commented on DIRMINA-1107: My point, exactly. We should have the {{SSLEngine}} process incoming messages immediately, and once decrypted, if they are application messages, being pushed into the filter chain immediately too (the special case is for handshake messages, because we don't push them anywhere, they are consumed by the SslHandler). At this point, we are doing archeocodology :-) > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839454#comment-16839454 ] Jonathan Valliere commented on DIRMINA-1107: [~elecharny] Why is there a {{messageReceivedEventQueue}} anyway? Shouldn't those messages go directly into the SSL Engine without delay? > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839167#comment-16839167 ] Emmanuel Lecharny commented on DIRMINA-1107: What I mean is that the {{messageReceivedEventQueue}} is only fed by the thread that read data from the channel (ie the {{IoProcessor}} thread). There is no other thread that interact with it, and there is no reason to postpone the propagation of the message to the chain. In other words, there is no reason to process those messages in the {{flushScheduledEvents}} method. The tricky part is to handle properly the handshake messages which should *not* be propagated to the chain. Anyway, that leave the other queue alone, and we can now safely process it in the {{flushScheduledEvents}} method without any interaction with the read messages. I have to do some experiment around this idea. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839151#comment-16839151 ] Guus der Kinderen commented on DIRMINA-1107: [~elecharny] wrote {quote} The flushScheduleEvents() method, which is called by the startTls, stopTls}}, messageReceived, filterWrite, filterClose and initiateHandshake methods, should first propagate the received messages - because it's synchronous - and then process all the messages to be written, in a protected section.{quote} I do not understand what you're trying to say. I don't think we can predict if the thread calling {{flushScheduledEvents}} is the single thread that processes incoming messages, or if it's a thread (potentially from a thread pool associated to an Executor) that is processing outgoing messages. Even if the outgoing messages would be guaranteed to be processed by a single thread, we'd still have to guard against the inbound and outbound thread calling {{flushScheduledEvents}} at the same time, right? > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838939#comment-16838939 ] Emmanuel Lecharny commented on DIRMINA-1107: https://tools.ietf.org/html/rfcThe big problem with 8446, 4.1.2, "Because TLS 1.3 forbids renegotiation, if a server has negotiated TLS 1.3 and receives a ClientHello at any other time, it MUST terminate the connection with an "unexpected_message" alert." The big issue with renegotiation is that you can DOS a server with it quite easily. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838924#comment-16838924 ] Jonathan Valliere commented on DIRMINA-1107: I thought there were other commands in TLS that control the encryption layer. Oh well, they removed Key Reneg in TLS 1.3? > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838921#comment-16838921 ] Emmanuel Lecharny commented on DIRMINA-1107: Key renegotiation (which has been removed in 1.3) starts like a handshake, with a {{ClientHello}} message, over the encrypted layer. This is a very specific use case, and should be handled as is. ATM, we don't have implemented it in MINA, AFAICT. What are the control functions you are mentioning ? > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838919#comment-16838919 ] Jonathan Valliere commented on DIRMINA-1107: [~elecharny] TLS key renegotiation and other control functions. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838918#comment-16838918 ] Emmanuel Lecharny commented on DIRMINA-1107: @Jonathan what for ? Once the HS is completed, if you receive something from a remote peer, you won't have to write anything back. The only exception is for the Handshake protocol. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838914#comment-16838914 ] Emmanuel Lecharny commented on DIRMINA-1107: Regarding the multi-threading aspect: * Incoming messages are handled by one single thread. A session is always associated with this thread when created, so we are always safe. * outgoing messages are a bit different, because we may have an executor in the chain. In this case, a session might call a write in many different threads, and that must be handled properly. In any case, once the data have been encrypted, and pushed into the {{IoProcessor}} queue (always the same), then it will be written by a single thread, the {{IoProcessor}} thread. So we have to consider protecting the {{SSL}} filter against code that has an executor. The {{flushScheduleEvents()}} method, which is called by the {{startTls}}, stopTls}}, {{messageReceived}}, {{filterWrite}}, {{filterClose}} and {{initiateHandshake}} methods, should first propagate the received messages - because it's synchronous - and then process all the messages to be written, in a protected section. Anyway, it's late, I need to rest and review the code again tomorrow or later on, to see if I'm right :-) > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838907#comment-16838907 ] Jonathan Valliere commented on DIRMINA-1107: Every time you receive a message you have to check to see if SSL needs to write a reply. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838898#comment-16838898 ] Emmanuel Lecharny commented on DIRMINA-1107: There are a few bad things done. In the {{SslFilter.messageReceived()}} method, we push any incoming message if the {{Ssl}} layer has not yet been installed fully : {code:java} synchronized (sslHandler) { if (!isSslStarted(session) && sslHandler.isInboundDone()) { // The SSL session must be established first before we // can push data to the application. Store the incoming // data into a queue for a later processing sslHandler.scheduleMessageReceived(nextFilter, message); } else { {code} It does not make a lot of sense. There is absolutely no way the {{SslFilter}} {{messageReceived}} method could be called if the filter has not been set fully. The check done is for the case where the {{SSL}} layer has been closed, but in this very case, what's the point in passing the received message to the {{IoHandler}} ? Otherwise, incoming messages are processed in the {{SslHandler.messageReceived}} method, and the message will be uncrypted, then pushed into the queue. The bizarre thing is that we do write encrypted data during this call : {code:java} private void handleSslData(NextFilter nextFilter, SslHandler sslHandler) throws SSLException { ... // Write encrypted data to be written (if any) sslHandler.writeNetBuffer(nextFilter); // handle app. data read (if any) handleAppDataRead(nextFilter, sslHandler); } {code} I don't see any reason to do so at this point. Why should we write *anything* back to the remote peer when we are processing an incoming message ? This part of the code needs a serious review, IMHO. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || >
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838709#comment-16838709 ] Guus der Kinderen commented on DIRMINA-1107: Thanks for all the feedback! For what it's worth: we're currently experimenting with a code change that intends to: * Ensure that queues are emptied at least as often as {{flushScheduledEvents}} is invoked (which should prevent that events remain in the queue indefinitely after the flush has been requested - we can't be sure that another flush is requested in a timely manner). * Not block any threads (continue to allow for the 'tryLock' if statement, to prevent worker threads from being blocked). We are attempting to do this by introducing a lock that is released, atomically, only after it has been "used" as often as the lock has attempted to be acquired. Our work-in-progress code: {code:java}private class CountReentrantLock { private int i = 0; private final ReentrantLock lock = new ReentrantLock(); public synchronized boolean tryAcquireLock() { i++; return lock.tryLock(); } public synchronized boolean tryUnlock() { i--; if(i<=0) { i=0; lock.unlock(); return true; } return false; } public synchronized void forceUnlock() { i = 0; lock.unlock(); } } {code} This then could be used like this: {code:java} void flushScheduledEvents() { if (sslLock.tryAcquireLock()) { try { do { while ((event = filterWriteEventQueue.poll()) != null) { // ... } while ((event = messageReceivedEventQueue.poll()) != null){ // ... } } while (!filterWriteEventQueue.isEmpty() || !messageReceivedEventQueue.isEmpty() || !sslLock.tryUnlock()); } catch( Throwable t) { sslLock.forceUnlock(); } } }{code} We've not tested this code yet. A concern that we haven't thought through yet is re-entry of a thread that already owns the lock. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > //
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838707#comment-16838707 ] Jonathan Valliere commented on DIRMINA-1107: If the same fix can be applied to both branches then we can update this ticket to include both. At this point in time we don't know that yet. I suggest focusing on fixing as-is then figure out what is an appropriate action for back porting after. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838682#comment-16838682 ] Emmanuel Lecharny commented on DIRMINA-1107: I'd rather have this issue cover both versions. i'll have a look at this code tonite, but it's a quite convoluted piece of code, so I may wait for Jonathan to be able to review it too so that we find a proper fix once and for all. Thanks for the report, Guus ! > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838675#comment-16838675 ] Jonathan Valliere commented on DIRMINA-1107: I updated the AFFECTS / FIX VERSION, once fixed we'll have to create another issue to back-port this fix to 2.0.X > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Affects Versions: 2.1.2 >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838673#comment-16838673 ] Jonathan Valliere commented on DIRMINA-1107: The counter doesn't make any sense since each queue flushes until empty and why would you increment the counter before flushing? Not much of this makes sense. Like I said, can't look at this until the weekend. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3, 2.0.23 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838669#comment-16838669 ] Emmanuel Lecharny commented on DIRMINA-1107: The pb is that we have 2 queues, and we have to check both of them at the same time. In any case, I also need to review this part of the code. AFAICT, the idea is to be sure we have pushed all the messages, in the proper order (we can't afford 2 threads to poll the queues and flush the messages at the same time). The thing is that I wonder why we need a counter at all. And BTW disregard the note on the counter that needs to be included in the lock section: this is nonsense. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3, 2.0.23 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838662#comment-16838662 ] Guus der Kinderen commented on DIRMINA-1107: We've found the issue in version 2.1.2. Placing the counter inside the protected code block would open the possibility for an invocation of {{flushScheudledEvents}} that does nothing (fails to acquire the lock), while there _are_ events in the queue. This would occur if the first thread just exited the outer 'do/while' loop, but before it released the lock in the 'finally' block, when the second thread tries to acquire the lock. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3, 2.0.23 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838654#comment-16838654 ] Jonathan Valliere commented on DIRMINA-1107: I'm not exactly sure the real purpose of that counter other than to create a concurrency problem. Aren't the EventQueues concurrent? I'm not the domain owner for the SSL implementation so I'd have to look at it next weekend when I have time. We keep saying that SSL needs to be completely redesigned, might be time to look at what we have and make recommendations. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3, 2.0.23 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838647#comment-16838647 ] Emmanuel Lecharny commented on DIRMINA-1107: Seems like we should include the counter into the locked part. Actually, I don't understand why it's not in this section... Something like : {code:java} void flushScheduledEvents() { if (sslLock.tryLock()) { scheduledEvents.incrementAndGet(); try { do { while ((event = filterWriteEventQueue.poll()) != null) { // ... } while ((event = messageReceivedEventQueue.poll()) != null){ // ... } } while (scheduledEvents.decrementAndGet() > 0); } finally { sslLock.unlock(); } } } {code} At this point, we don't even have to use an AtomicInteger for the {{scheduledEvents}} value. > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3, 2.0.23 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
[ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838625#comment-16838625 ] Jonathan Valliere commented on DIRMINA-1107: What version does this apply to? > SslHandler flushScheduledEvents race condition, redux > - > > Key: DIRMINA-1107 > URL: https://issues.apache.org/jira/browse/DIRMINA-1107 > Project: MINA > Issue Type: Bug >Reporter: Guus der Kinderen >Priority: Major > Fix For: 2.1.3, 2.0.23 > > > DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally > replaces it with another multithreading issue. > The fix for DIRMINA-1019 introduces a counter that contains the number of > events to be processed. A simplified version of the code is included below. > {code:java} > private final AtomicInteger scheduledEvents = new AtomicInteger(0); > void flushScheduledEvents() { > scheduledEvents.incrementAndGet(); > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (scheduledEvents.decrementAndGet() > 0); > } finally { > sslLock.unlock(); > } > } > }{code} > We have observed occasions where the value of {{scheduledEvents}} becomes a > negative value, while at the same time {{filterWriteEventQueue}} go > unprocessed. > We suspect that this issue is triggered by a concurrency issue caused by the > first thread decrementing the counter after a second thread incremented it, > but before it attempted to acquire the lock. > This allows the the first thread to empty the queues, decrementing the > counter to zero and release the lock, after which the second thread acquires > the lock successfully. Now, the second thread processes any elements in > {{filterWriteEventQueue}}, and then processes any elements in > {{messageReceivedEventQueue}}. If in between these two checks yet another > thread adds a new element to {{filterWriteEventQueue}}, this element can go > unprocessed (as the second thread does not loop, since the counter is zero or > negative, and the third thread can fail to acquire the lock). > It's a seemingly unlikely scenario, but we are observing the behavior when > our systems are under high load. > We've applied a code change after which this problem is no longer observed. > We've removed the counter, and check on the size of the queues instead: > {code:java} > void flushScheduledEvents() { > if (sslLock.tryLock()) { > try { > do { > while ((event = filterWriteEventQueue.poll()) != null) { > // ... > } > > while ((event = messageReceivedEventQueue.poll()) != null){ > // ... > } > } while (!filterWriteEventQueue.isEmpty() || > !messageReceivedEventQueue.isEmpty()); > } finally { > sslLock.unlock(); > } > } > }{code} > This code change, as illustrated above, does introduce a new potential > problem. Theoretically, an event could be added to the queues and > {{flushScheduledEvents}} be called returning {{false}} for > {{sslLock.tryLock()}}, exactly after another thread just finished the > {{while}} loop, but before releasing the lock. This again would cause events > to go unprocessed. > We've not observed this problem in the wild yet, but we're uncomfortable > applying this change as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)