On 02/09/11 04:55, Alex Rousskov wrote:
On 09/01/2011 05:47 AM, Amos Jeffries wrote:
This patch seeks to avoids race conditions between the AsyncCalls for
the pconn monitoring handlers and lookups pulling pconn up for use.

...

It adds checks for the timeout handler, as discussed in bug 3281,
preventing the related type of race from the timeout handler.

+    // our connection timeout handler is scheduled to run already. unsafe
+    if (fd_table[conn->fd].timeoutHandler == NULL)
+        return false;

I am not strongly against this, but it would be more elegant and may be
more efficient to let the about-to-timeout connections to be reused
instead. This can be safely done, for example, by canceling the pending
timeout notification callback in the connection giving code.

Cancelling calls here would mean re-writing IdleConnList to store those calls to be cancelled. As it stands right now once scheduled there is no code outside the AsyncEngine with a pointer to them. This patch would still be needed after all that change but would cancel and return true instead of just returning false.

I'll post a new patch shortly. With isAvailable() protecting against unsafe re-use cases the handlers on both read and timeout can now unconditionally cleanup and close the idle conn. Read callback may not even need to bother about the CLOSING flag, but I think the forced fd_table[] close cycle on shutdown (bug 2110) might still cause that case.


Thank you,

Alex.
P.S. Higher level rant: We have an increasing amount of code devoted to
checking whether some Comm information is already "in flight" and trying
to predict what it will be or trying to react to it before it comes to
us. Most of that code accesses internal Comm structures. I understand
that this code is being created to address specific problems, but the
direction of this worries me.

We need to think of a better (uniform and encapsulated) way to handle
the asynchronous nature of Comm-Job relationship or we will end up with
too many similar-but-different and half-working checks in all Comm-using
Jobs. We have been there before with the old spaghetti code (for
slightly different reasons); let's try not to fall into the same trap.

I do not have a well-formulated suggestion for the above high-level
problem, but I may be able to spend some time on it after Rock Store is
in. Meanwhile, every time you write fd_table[] outside of Comm, please
pause and ask yourself whether there is a better way of doing it :-).

Agreed, I'm pretty sure accessing fd_table all over is one part of the CPU cache busting performance problem we have.


I was experimenting with this earlier. The best solution does seem to be doing pro-active and rigorous cancel() everywhere relevant. Which means speeding up the SourceLayout conversion to use polished Async everywhere. Doxygen can help us there telling where each member of fd_table is used. (provided we have set it to auto-define ALL the component macros)


*** No more wrapper transitional functions. Code which requires them does not have access to the Call to cancel() it or enough Async support to be optimized well. If you look closely, that is the code leading to these races more often than not. Followed by the upgraded code like IdleConnList which forgets the calls its registered.


I suspect, but am not sure, that if the cancel() code for AsyncCall was also to get their Params to drop any RefCount pointers etc it has, we could reduce the in-flight RAM usage a bit for essentially free. That would also shift CPU time from the relevant destructors out of the AsyncEngine where it drops the cancelled calls, to the code which was cancelling. Possibly giving us a bit better performance measure.
 Looks like a major deep change though.

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.15
  Beta testers wanted for 3.2.0.10

Reply via email to