Re: [squid-dev] [PATCH] url_rewrite_timeout directive

Tsantilas Christos Tue, 18 Nov 2014 01:45:12 -0800

On 11/18/2014 01:27 AM, Amos Jeffries wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On 18/11/2014 6:10 a.m., Tsantilas Christos wrote:

On 11/16/2014 01:05 PM, Amos Jeffries wrote:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 16/11/2014 7:38 a.m., Tsantilas Christos wrote:


> ....


For the record I am still extremely skeptical about the use-case
behind this feature. Timeout i a clear sign that the helper or
system behind it is broken (capacity overload / network
congestion). Continuing to use such systems in a way which
further increases network load is very often a bad choice. Hiding
the issue an even worse choice.



Maybe the administrator does not care about unanswered queries.


I posit that admin who truely do not care about unanswered questions
will not care enough to bother configuring this feature.


So he will not do it.
The default behaviour is to not use timeout.
The default behaviour is not changed.


The administrator should always care about these unanswered questions.
Each one means worse proxy performance. Offering to hide them silently
and automatically without anybody having to do any work about fixing
the underlying cause implies they are not actually problems. Which is
false implication.

This option can be used to increase proxy performance. This is thepurpose of this option.

Imagine a huge squid proxy, which may have to process thousands of
URLs per minute and the administrator decided that a 10% of helpers
request can fail. In this case we are giving him a way to configure
the squid behaviour in this case: may ignore the probelm, or use a
predefined answer etc


Now that same Squid proxy suddenly starts randomly wrongly rejecting
just one request per minute out of all those thousands.
  Which one, why, and WTF can anybody do about it?


Just only requests which timed out and only if configured to do it...


Previous Squid would log a client request with _TIMEOUT, leave a
helper in Busy or Pending state with full trace of the pending lookup
in mgr reports, possibly even cache.log warnings about helpers queue
length.

The patch does not change squid behaviour, if the timeout is notconfigured (default).


Now all that is reduced to an overly simple aggregated "Requests timed
out:" hiding in a rarely viewed manager report and a hidden level-3
debug message that lists an anonymous requestId from N identical
requestIds spread over N helpers.

IF configured you will see in mgr report a "Requests timed out:" foreach running server.If you see that a server has many timedout requests, more than the otherservers then you can kiil it if you consider it as a problem.


The reason the helper is not answered enough fast, maybe is a
database or an external lookup failure (for example categorized
urls list as a DNS service). In these cases the system admin or
service provider, may prefer a none answer from the helper, or a
preconfigured answer, instead of waiting too long for the answer.


What to do if the internal dependencies are going badly is something
that should be handled *inside the helper*.
  After all Squid has absolutely no way to know if its generic lookup
helper has the DNS lookup on the DB server name or the DB query itself
broken. Each of which might have a different "best" way to respond.

The intention of the BrokenHelper code was to have the helpers
explicitly inform Squid that they were in trouble and needed help
shifting the load elsewhere.

Squid silently "Forgetting" that requests have been sent and then
sending *more* is a truely terrible way to fix all the cases of helper
overload.


Again, the timeout is optional. It is an extra option.
If the customer/squid-user, wants to use BH code, still can do it.


The helper can be implemented to not answer at all after a timeout
period. A policy of "if you do not answer in a day, please do not
answer at all, I am not interested any mode" is common in human
world, and in business.


Yes, it is also being fingered as a major reason for businesses dying
off during the recent recession....

:-)

Well, we can do a huge discussion about the recent recession, but I amsure I have more examples than you on failing businesses under anrecessionary environment!

..... Late respondants lost out on work and
thus cashflow.

Yes but you can not avoid such cases. A stop-loss, after a reasonableconfigured timeout, is not a bad tactic in such cases.


in src/cache_cf.cc

* please move the *_url_rewrite_timeout() functions to
redirect.cc - that will help reduce dependency issues when we get
around to making the redirector config a standalone FooConfig
object.


Not so easy, because of dependencies on time-related parsing
functions.... I let it for now. If required we must move time parse
functions to a library file, for example Config.cc


Pity. Oh well.


* it would be simpler and easier on users to omit the
"on_timeout=use_configured_response response=" and just have the
existence of a response= kv-pair imply the on_timeout action. -
that also allows the above suggested removal of bypass and retry
special actions without loosing any functionality. - that removes
the FATAL errors if admin gets the options ordering different
from what you expect. - and allows for forward-compatibility with
unknown options. A while loop fetching kv-pair and processing
them as found regardess of order seems best for a field
documented a " [options] ".


True. This is fixed in this patch


Added a memory leak when setting Config.onUrlRewriteTimeout.response
multiple times.

Also in free_url_rewrite_timeout() use safe_free() to release and NULL
the pointer in one call.

Also in dump_url_rewrite_timeout()please do not assert(). This will do
nasty things in SMP mode manager reports. Handle the error instead.


ok for these.


* please define the new handleAdaptationFailure() *above* the
new calloutsError(). - that will reduce the diff and make it
clearer what actually is changing and what not.



Still not doing this. It adds 40 lines to your diff when there is only
5-6 lines of change.


sorry. Looks that I  forgot to do save before make the diff...


* its now possible for adaptation error to silently continue
processing. Before it would hang. I am not sure that is a good or
bad change. But it may be worth noting the difference in
handleAdaptationFailure() where it not unconditionally calls
doCallouts() at the end.


There is not any change on handling adaptation errors, or should
not be. Am I loosing something?


The new handleAdaptationFailure() runs doCallout unconditionally:
   ...
   calloutsError(ERR_ICAP_FAILURE, errDetail);
   doCallouts();

The old code was:
  ...
  if (calloutContext) {
    ... <body of new calloutsError> ...
    doCallouts();
  }


The new code for calloutsError() uses the if-condition internally now.
But this leaves the terminal doCallouts() from
handleAdaptationFailure() unprotected.

I think I had check and I found that it is not possible to havecalloutContext==NULL.I need to put an "if (calloutContext)" check before call doCallouts tobe safe.

in src/helper.cc:

* adding multiple double-empty lines in the chunk @195

* helper_server::checkForTimedOutRequests() is deducting from
stats.pending pending. This is incorrect, the helper request is
still pending.


Amos, I think it is better as is.... Maybe in a point of view the
request is still pending, but we are not waiting an answer for this
request any more.


Any helper which is obeying the helper protocol *will* still be
consuming resources trying to satisfy it.
This is what I mean by forgetting and sending more requests to the
helper "making the overload problem worse".

Maybe we should document this behaviour in option documentation. Theadministrators who want to use timeout they have to take care of it.

- your act of removing the requests from the queue invalidates
the logic in helperReturnBuffer() recording and managing late
helper responses.


why?


Non-concurrent helpers:
  * #1 query times out and gets discarded.
  * Lookup logics finding a helper to use see a helper without pending
requests and deliver a second lookup (#2).
  * #1 response arrives late and is used as response to the current
pending (#2) lookup.


This is will be discarded.

This is because the request will get new requestId before retried. Theold requestId is not valid any more.

  * #2 response arrives and is discarded.


This is will be accepted.

- --> helper responses are hereafter off-by-1 until another timeout
makes it off-by-2 or lucky random timing makes it off-by-0 again.

Please notice that the helpers with most common long duration lookups
are NTLM and Kerberos crypto validation. Which do not support
concurrency channels.


Yes. They can not use timeout feature.

- both of the above also corrupts the mgr report content for
expected pending responses. By all means call (and clear) the
callback on a timeout but please do not remove the queue entries
for them until the helper actually responds to that query.


Some of the queries will never  answered. This is may result to
memory leaks.


see the point I made next:

- if a helper gets too many un-responded queries the traffic
being served by it is f'ckd and we should shutdown the helper. If
an auto-restart does not work Manual intervention is probably
required from the admin to resolve why it is timing out *at
all*.

Yes this is true. But the admin can find out if many requests are timedout for a server using mgr. And then he can kill problematic server.

NP: when doing a re-try we still may get a response to this
lookup from *either* of the helpers which have been asked for it.
The frst helper is in fact more likely to respond first. IMO we
should be taking advantage of that and leaving both helpers with
shared pointer which either can call+unset the callback for.


Maybe this is good optimization. However requires some work.... We
can add it later if required as optimization...


Ok. Please add an TODO about it at least.

ok


Please also document clearly in the release notes that:

A) helper-mux.pl we have been distributing for the past few years
to encourage use of concurrency is no longer compatible with
Squid. If used it will spawn up to 2^64 helpers and DoS the Squid
server.


What is this helper? I can not find it inside squid sources....


Under tools/
http://wiki.squid-cache.org/Features/HelperMultiplexer


We need to fix this helper to work with new changes too....
OK.

B) helpers utilizing arrays to handle fixed amounts of
concurrency channels MUST be re-written to use queues and capable
of handling a 64-bit int as index or they will be vulnerable to
buffer overrun and arbitrary memory accesses.



OK for this


C) 32-bit helpers need re-writing to handle the concurrency
channel ID as a 64-bit integer value. If not updated they will
cause proxies to return unexpected results or timeout once
crossing the 32-bit wrap boundary. Leading to undefined behaviour
in the client HTTP traffic.

OK.



My vote on this is -0. Too many problems and still no clear
use-case has been described[1] for this to be a reasonable
addition.



OK. I tried to do most  of the fixes you are suggested. I hope it
is OK.


Thank you for your comments.
Regards,
   Christos


Amos



_______________________________________________
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev

Re: [squid-dev] [PATCH] url_rewrite_timeout directive

Reply via email to