Re: [PATCH] Bug 3281

Amos Jeffries Thu, 01 Sep 2011 23:30:08 -0700

On 02/09/11 04:55, Alex Rousskov wrote:

On 09/01/2011 05:47 AM, Amos Jeffries wrote:

This patch seeks to avoids race conditions between the AsyncCalls for
the pconn monitoring handlers and lookups pulling pconn up for use.

...

It adds checks for the timeout handler, as discussed in bug 3281,
preventing the related type of race from the timeout handler.

+    // our connection timeout handler is scheduled to run already. unsafe
+    if (fd_table[conn->fd].timeoutHandler == NULL)
+        return false;


I am not strongly against this, but it would be more elegant and may be
more efficient to let the about-to-timeout connections to be reused
instead. This can be safely done, for example, by canceling the pending
timeout notification callback in the connection giving code.

Cancelling calls here would mean re-writing IdleConnList to store thosecalls to be cancelled. As it stands right now once scheduled there is nocode outside the AsyncEngine with a pointer to them.This patch would still be needed after all that change but wouldcancel and return true instead of just returning false.

I'll post a new patch shortly. With isAvailable() protecting againstunsafe re-use cases the handlers on both read and timeout can nowunconditionally cleanup and close the idle conn. Read callback may noteven need to bother about the CLOSING flag, but I think the forcedfd_table[] close cycle on shutdown (bug 2110) might still cause that case.


Thank you,

Alex.
P.S. Higher level rant: We have an increasing amount of code devoted to
checking whether some Comm information is already "in flight" and trying
to predict what it will be or trying to react to it before it comes to
us. Most of that code accesses internal Comm structures. I understand
that this code is being created to address specific problems, but the
direction of this worries me.

We need to think of a better (uniform and encapsulated) way to handle
the asynchronous nature of Comm-Job relationship or we will end up with
too many similar-but-different and half-working checks in all Comm-using
Jobs. We have been there before with the old spaghetti code (for
slightly different reasons); let's try not to fall into the same trap.

I do not have a well-formulated suggestion for the above high-level
problem, but I may be able to spend some time on it after Rock Store is
in. Meanwhile, every time you write fd_table[] outside of Comm, please
pause and ask yourself whether there is a better way of doing it :-).

Agreed, I'm pretty sure accessing fd_table all over is one part of theCPU cache busting performance problem we have.

I was experimenting with this earlier. The best solution does seem to bedoing pro-active and rigorous cancel() everywhere relevant. Which meansspeeding up the SourceLayout conversion to use polished Async everywhere.Doxygen can help us there telling where each member of fd_table isused. (provided we have set it to auto-define ALL the component macros)

*** No more wrapper transitional functions. Code which requires themdoes not have access to the Call to cancel() it or enough Async supportto be optimized well. If you look closely, that is the code leading tothese races more often than not. Followed by the upgraded code likeIdleConnList which forgets the calls its registered.

I suspect, but am not sure, that if the cancel() code for AsyncCall wasalso to get their Params to drop any RefCount pointers etc it has, wecould reduce the in-flight RAM usage a bit for essentially free.That would also shift CPU time from the relevant destructors out ofthe AsyncEngine where it drops the cancelled calls, to the code whichwas cancelling. Possibly giving us a bit better performance measure.

 Looks like a major deep change though.

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.15
  Beta testers wanted for 3.2.0.10

Re: [PATCH] Bug 3281

Reply via email to