[ https://issues.apache.org/jira/browse/DIRMINA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Emmanuel Lecharny resolved DIRMINA-753. --------------------------------------- Resolution: Fixed Applied the proposed fix, seems to be solid. May be not perfect, but in any case, better than what we currently have ! Many thanks John, brillant analysis ! http://svn.apache.org/viewvc?rev=898583&view=rev > Some writes observed to stall until select times out > ---------------------------------------------------- > > Key: DIRMINA-753 > URL: https://issues.apache.org/jira/browse/DIRMINA-753 > Project: MINA > Issue Type: Bug > Components: Core > Affects Versions: 2.0.0-RC1 > Environment: java version "1.6.0_16" > Java(TM) SE Runtime Environment (build 1.6.0_16-b01) > Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) > Reporter: John R. Fallows > Assignee: Emmanuel Lecharny > Fix For: 2.0.0-RC2 > > > We have observed an issue in Mina 2.0.0-RC1 where TCP writes are not flushed > to the network even though the receiver is keeping up and CPU has low > utilization. > In the test scenario, all of the clients initially connect to the server, > then the server starts writing messages to each of the clients. The clients > connect via localhost to eliminate clock drift and physical network impact, > and we measure the observable latency of the message between sender and > receiver. > Note that the clients do not send any data to the server after messages start > flowing from the server, just ack traffic. > When the problem occurs, some messages are unexpectedly delayed by > approximately 1000ms. > By observing a tcpdump of the wire traffic during the test, we verified that > the receiver was keeping up with message delivery, but that message delivery > would sometimes stall for approximately 1000ms after which the delayed > messages would be written to the wire. > By re-running the same test with a locally modified version of > AbstractPollingIoProcessor, such that the select() timeout was 750ms > instead of 1000ms, we found that the messages were delayed by approximately > 750ms instead. > Further inspection of AbstractPollingIoProcessor.flush(T session) suggests a > potential race condition that would explain this behavior: > public final void flush(T session) { > boolean needsWakeup = flushingSessions.isEmpty(); > if (scheduleFlush(session) && needsWakeup) { > wakeup(); > } > } > For the purposes of this description, let's say > AbstractPollingIoProcessor.flush(T session) is called by the [writer] > thread, while Selector.select(long timeout) is called by the [processor] > thread. > [writer] flushingSessions.isEmpty() -> false (last flushing session > being processed) > [processor] flush(currentTime), flushingSessions.poll() empties > flushingSessions, completely write last flushing session, write request queue > empty, no subsequent flush scheduled > [writer] scheduleFlush(session) returns true, flushingSessions now > non-empty, requires wakeup(), but not called (!) > [processor] stalls until select timeout expires, then flushes > It seems that the intent of needsWakeup is to avoid calling wakeup() > unnecessarily by tracking when flushingSessions becomes non-empty, but it > does not seem to capture the edge case where flushSessions starts out > non-empty, then becomes empty, then becomes non-empty again, triggering the > stall. > We found that the following modification to > AbstractPollingIoProcessor.flush(T session) addressed the issue without > adverse effects. > public final void flush(T session) { > if (scheduleFlush(session)) { > wakeup(); > } > } > The difference in behavior is as follows: > 1. wakeup() could be called many times between calls to select() > 2. select() might unblock one extra time when the selector is otherwise idle > Case (1) is documented by Selector.select to be equivalent to a single call > to wakeup() between calls to select(). The assumption is that subsequent > calls to wakeup() are cheap because it is latched internally. > Case (2) can occur as follows: > [writer] scheduleFlush(session) returns true > [processor] processes all flushingSessions, including most recent addition by > [writer] > [writer] calls wakeup() > [processor] immediately unblocks next call to select() due to wakeup() call > by [writer] > [processor] blocks again on next call to select() > When the [processor] unblocks select() due to wakeup() call by [writer], this > is not an _extra_ unblock when there is other work to > do in the [processor] loop, so any additional overhead introduced by > eliminating the needsWakeup check is only observed when the selector > is otherwise idle, and is therefore considered harmless. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.