[ 
https://issues.apache.org/jira/browse/DIRMINA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emmanuel Lecharny resolved DIRMINA-753.
---------------------------------------

    Resolution: Fixed

Applied the proposed fix, seems to be solid. May be not perfect, but in any 
case, better than what we currently have !

Many thanks John, brillant analysis !

http://svn.apache.org/viewvc?rev=898583&view=rev

> Some writes observed to stall until select times out
> ----------------------------------------------------
>
>                 Key: DIRMINA-753
>                 URL: https://issues.apache.org/jira/browse/DIRMINA-753
>             Project: MINA
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 2.0.0-RC1
>         Environment: java version "1.6.0_16"
> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
>            Reporter: John R. Fallows
>            Assignee: Emmanuel Lecharny
>             Fix For: 2.0.0-RC2
>
>
> We have observed an issue in Mina 2.0.0-RC1 where TCP writes are not flushed 
> to the network even though the receiver is keeping up and CPU has low 
> utilization.
> In the test scenario, all of the clients initially connect to the server, 
> then the server starts writing messages to each of the clients.  The clients 
> connect via localhost to eliminate clock drift and physical network impact, 
> and we measure the observable latency of the message between sender and 
> receiver.
> Note that the clients do not send any data to the server after messages start 
> flowing from the server, just ack traffic.
> When the problem occurs, some messages are unexpectedly delayed by 
> approximately 1000ms.
> By observing a tcpdump of the wire traffic during the test, we verified that 
> the receiver was keeping up with message delivery, but that message delivery 
> would sometimes stall for approximately 1000ms after which the delayed 
> messages would be written to the wire.
> By re-running the same test with a locally modified version of 
> AbstractPollingIoProcessor, such that the select() timeout was 750ms
> instead of 1000ms, we found that the messages were delayed by approximately 
> 750ms instead.
> Further inspection of AbstractPollingIoProcessor.flush(T session) suggests a 
> potential race condition that would explain this behavior:
>    public final void flush(T session) {
>        boolean needsWakeup = flushingSessions.isEmpty();
>        if (scheduleFlush(session) && needsWakeup) {
>            wakeup();
>        }
>    }
> For the purposes of this description, let's say 
> AbstractPollingIoProcessor.flush(T session) is called by the [writer]
> thread, while Selector.select(long timeout) is called by the [processor] 
> thread.
> [writer]       flushingSessions.isEmpty() -> false (last flushing session 
> being processed)
> [processor] flush(currentTime), flushingSessions.poll() empties 
> flushingSessions, completely write last flushing session, write request queue 
> empty, no subsequent flush scheduled
> [writer]       scheduleFlush(session) returns true, flushingSessions now 
> non-empty, requires wakeup(), but not called (!)
> [processor] stalls until select timeout expires, then flushes
> It seems that the intent of needsWakeup is to avoid calling wakeup() 
> unnecessarily by tracking when flushingSessions becomes non-empty, but it 
> does not seem to capture the edge case where flushSessions starts out 
> non-empty, then becomes empty, then becomes non-empty again, triggering the 
> stall.
> We found that the following modification to 
> AbstractPollingIoProcessor.flush(T session) addressed the issue without 
> adverse effects.
>    public final void flush(T session) {
>        if (scheduleFlush(session)) {
>            wakeup();
>        }
>    }
> The difference in behavior is as follows:
>  1. wakeup() could be called many times between calls to select()
>  2. select() might unblock one extra time when the selector is otherwise idle
> Case (1) is documented by Selector.select to be equivalent to a single call 
> to wakeup() between calls to select().  The assumption is that subsequent 
> calls to wakeup() are cheap because it is latched internally.
> Case (2) can occur as follows:
> [writer]       scheduleFlush(session) returns true
> [processor] processes all flushingSessions, including most recent addition by 
> [writer]
> [writer]       calls wakeup()
> [processor] immediately unblocks next call to select() due to wakeup() call 
> by [writer]
> [processor] blocks again on next call to select()
> When the [processor] unblocks select() due to wakeup() call by [writer], this 
> is not an _extra_ unblock when there is other work to
> do in the [processor] loop, so any additional overhead introduced by 
> eliminating the needsWakeup check is only observed when the selector
> is otherwise idle, and is therefore considered harmless.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to