Jason,

Thanks for your contribution!
Would you mind to open a JIRA bug report and attach the diff file to
it, checking the ASF license check box?

Thanks a lot,

  Bernd

On Sat, Apr 3, 2010 at 17:10, Jason Resch <[email protected]> wrote:
> All,
>
> I think I've discovered a bug in Mina's TLS code.  While I encountered this
> bug in Mina 1.1.7 it seems the same problem code exists in Mina 2.0.  The
> code in question is the flushScheduledEvents() method in SSLHandler.java:
>
> public void flushScheduledEvents() {
>   // Fire events only when no lock is hold for this handler.
>   if (Thread.holdsLock(this)) {
>       return;
>   }
>
>   Event e;
>         // We need synchronization here inevitably because filterWrite can
> be
>   // called simultaneously and cause 'bad record MAC' integrity error.
>   synchronized (this) {
>       while ((e = filterWriteEventQueue.poll()) != null) {
>           e.nextFilter.filterWrite(session, (WriteRequest) e.data);
>       }
>   }
>
>   while ((e = messageReceivedEventQueue.poll()) != null) {
>       e.nextFilter.messageReceived(session, e.data);
>   }
> }
>
> This method is called both by threads which handle writes, and threads that
> handle reads.  Therefore, as the comments suggest, multiple threads may go
> through this code simultaneously.  However, since there is no
> synchronization around processing of the messageReceivedEventQueue, it is
> possible that the received messages will be sent to the next filter out of
> order, should there be more than one message in the queue and a context
> switch happen at the wrong time.
>
> The bug would manifest in our application as a failure of our protocol layer
> to decode a message, we believe, due to a re-ordering.  It only occurred in
> environments with a large amount of contention and network traffic and when
> using TLS.  The fix I have tested was to move the closing brace of the
> synchronized block to extend to cover both while loops.  I've attached a
> patch representing that change.  Since making that change we have not
> encountered the bug again after about 30 hours of testing and 1.5 TB of
> traffic, whereas before the change we could reproduce it after a few
> minutes.
>
>
> Jason
>
>

Reply via email to