Re: Bad interaction between max_stale and negative caching (2.HEAD)

2008-09-18 Thread Mark Nottingham

Bug 2468;
  http://www.squid-cache.org/bugs/show_bug.cgi?id=2468
patch attached.

I marked it as blocking 2255, as that's the bug for porting stale-if- 
error to 3.x.




On 18/09/2008, at 10:39 PM, Amos Jeffries wrote:


Mark Nottingham wrote:
I've got a user who's running a pair of peered accelerators, using  
both stale-while-revalidate and max_stale.
Occasionally, they see extremely old content being served; e.g., if  
CC: max-age is 60s, they might see something go by which is  
1000-3000 seconds old (but still within the max_stale window).
The pattern that appears to trigger this is when a resource with an  
in-cache 200 response starts returning 404s; when this happens,  
Squid will start returning TCP_NEGATIVE_HIT/200's. E.g. (traffic  
driven by squidclient),
1713703.815  0 127.0.0.1 TCP_STALE_HIT/200 5234 GET http:// 
server1//5012904 - NONE/- application/json
1221713703.979164 0.0.0.0 TCP_ASYNC_MISS/404 193 GET http://server 
1/5012904 - FIRST_UP_PARENT/back-end-server1 text/plain
1221713711.431  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server 
1/5012904 - NONE/- application/json
1221713720.978  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server 
1/5012904 - NONE/- application/json
1221713723.483  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server 
1/5012904 - NONE/- application/json
As you can see, stale-while-revalidate kicks in, and the async  
refresh brings back a 404, but that doesn't get stored properly.
Looking at the code, I *think* the culprit is storeNegativeCache(),  
which will, assuming that max_stale is set (either in squid.conf or  
response headers), block the new response from updating the cache  
-- no matter what its status code is.
It makes sense to do this for 5xx status codes, because they're  
often transient, and reflect server-side problems. It doesn't make  
as much sense to do this for 4xx status codes, which reflect client- 
side issues. In those cases, you always want to update the cache  
with the most recent response (and potentially negative cache it,  
if the server is silly enough to not put a freshness lifetime on it).
The interesting thing, BTW, is that this only happens when  
collapsed forwarding is on, because this in httpReplyProcessHeader:
 if (neighbors_do_private_keys && ! 
Config.onoff.collapsed_forwarding)

   httpMaybeRemovePublic(entry, reply);
masks this behaviour.
Thoughts? I'm not 100% on this diagnosis, as the use of peering and  
stale-while-revalidate make things considerably more complex, but  
I've had pretty good luck reproducing it... I'm happy to attempt a  
fix, but wanted input on what approach people preferred. Left to my  
own devices, I'd add another condition to this in  
storeNegativeCache():
if (oe && !EBIT_TEST(oe->flags, KEY_PRIVATE) && !EBIT_TEST(oe- 
>flags, ENTRY_REVALIDATE))

to limit it to 5xx responses.


I'd agree with you based on that analysis. Can you add a bugzilla  
entry with a patch that does it?


Amos
--
Please use Squid 2.7.STABLE4 or 3.0.STABLE9


--
Mark Nottingham   [EMAIL PROTECTED]




Re: Bad interaction between max_stale and negative caching (2.HEAD)

2008-09-18 Thread Mark Nottingham

Will do tomorrow.

On 18/09/2008, at 10:39 PM, Amos Jeffries wrote:


Mark Nottingham wrote:
I've got a user who's running a pair of peered accelerators, using  
both stale-while-revalidate and max_stale.
Occasionally, they see extremely old content being served; e.g., if  
CC: max-age is 60s, they might see something go by which is  
1000-3000 seconds old (but still within the max_stale window).
The pattern that appears to trigger this is when a resource with an  
in-cache 200 response starts returning 404s; when this happens,  
Squid will start returning TCP_NEGATIVE_HIT/200's. E.g. (traffic  
driven by squidclient),
1713703.815  0 127.0.0.1 TCP_STALE_HIT/200 5234 GET http:// 
server1//5012904 - NONE/- application/json
1221713703.979164 0.0.0.0 TCP_ASYNC_MISS/404 193 GET http://server 
1/5012904 - FIRST_UP_PARENT/back-end-server1 text/plain
1221713711.431  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server 
1/5012904 - NONE/- application/json
1221713720.978  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server 
1/5012904 - NONE/- application/json
1221713723.483  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server 
1/5012904 - NONE/- application/json
As you can see, stale-while-revalidate kicks in, and the async  
refresh brings back a 404, but that doesn't get stored properly.
Looking at the code, I *think* the culprit is storeNegativeCache(),  
which will, assuming that max_stale is set (either in squid.conf or  
response headers), block the new response from updating the cache  
-- no matter what its status code is.
It makes sense to do this for 5xx status codes, because they're  
often transient, and reflect server-side problems. It doesn't make  
as much sense to do this for 4xx status codes, which reflect client- 
side issues. In those cases, you always want to update the cache  
with the most recent response (and potentially negative cache it,  
if the server is silly enough to not put a freshness lifetime on it).
The interesting thing, BTW, is that this only happens when  
collapsed forwarding is on, because this in httpReplyProcessHeader:
 if (neighbors_do_private_keys && ! 
Config.onoff.collapsed_forwarding)

   httpMaybeRemovePublic(entry, reply);
masks this behaviour.
Thoughts? I'm not 100% on this diagnosis, as the use of peering and  
stale-while-revalidate make things considerably more complex, but  
I've had pretty good luck reproducing it... I'm happy to attempt a  
fix, but wanted input on what approach people preferred. Left to my  
own devices, I'd add another condition to this in  
storeNegativeCache():
if (oe && !EBIT_TEST(oe->flags, KEY_PRIVATE) && !EBIT_TEST(oe- 
>flags, ENTRY_REVALIDATE))

to limit it to 5xx responses.


I'd agree with you based on that analysis. Can you add a bugzilla  
entry with a patch that does it?


Amos
--
Please use Squid 2.7.STABLE4 or 3.0.STABLE9


--
Mark Nottingham   [EMAIL PROTECTED]




Re: Bad interaction between max_stale and negative caching (2.HEAD)

2008-09-18 Thread Amos Jeffries

Mark Nottingham wrote:
I've got a user who's running a pair of peered accelerators, using both 
stale-while-revalidate and max_stale.


Occasionally, they see extremely old content being served; e.g., if CC: 
max-age is 60s, they might see something go by which is 1000-3000 
seconds old (but still within the max_stale window).


The pattern that appears to trigger this is when a resource with an 
in-cache 200 response starts returning 404s; when this happens, Squid 
will start returning TCP_NEGATIVE_HIT/200's. E.g. (traffic driven by 
squidclient),


1713703.815  0 127.0.0.1 TCP_STALE_HIT/200 5234 GET 
http://server1//5012904 - NONE/- application/json
1221713703.979164 0.0.0.0 TCP_ASYNC_MISS/404 193 GET 
http://server1/5012904 - FIRST_UP_PARENT/back-end-server1 text/plain
1221713711.431  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET 
http://server1/5012904 - NONE/- application/json
1221713720.978  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET 
http://server1/5012904 - NONE/- application/json
1221713723.483  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET 
http://server1/5012904 - NONE/- application/json


As you can see, stale-while-revalidate kicks in, and the async refresh 
brings back a 404, but that doesn't get stored properly.


Looking at the code, I *think* the culprit is storeNegativeCache(), 
which will, assuming that max_stale is set (either in squid.conf or 
response headers), block the new response from updating the cache -- no 
matter what its status code is.


It makes sense to do this for 5xx status codes, because they're often 
transient, and reflect server-side problems. It doesn't make as much 
sense to do this for 4xx status codes, which reflect client-side issues. 
In those cases, you always want to update the cache with the most recent 
response (and potentially negative cache it, if the server is silly 
enough to not put a freshness lifetime on it).


The interesting thing, BTW, is that this only happens when collapsed 
forwarding is on, because this in httpReplyProcessHeader:


  if (neighbors_do_private_keys && !Config.onoff.collapsed_forwarding)
httpMaybeRemovePublic(entry, reply);

masks this behaviour.

Thoughts? I'm not 100% on this diagnosis, as the use of peering and 
stale-while-revalidate make things considerably more complex, but I've 
had pretty good luck reproducing it... I'm happy to attempt a fix, but 
wanted input on what approach people preferred. Left to my own devices, 
I'd add another condition to this in storeNegativeCache():


if (oe && !EBIT_TEST(oe->flags, KEY_PRIVATE) && !EBIT_TEST(oe->flags, 
ENTRY_REVALIDATE))


to limit it to 5xx responses.



I'd agree with you based on that analysis. Can you add a bugzilla entry 
with a patch that does it?


Amos
--
Please use Squid 2.7.STABLE4 or 3.0.STABLE9


Bad interaction between max_stale and negative caching (2.HEAD)

2008-09-18 Thread Mark Nottingham
I've got a user who's running a pair of peered accelerators, using  
both stale-while-revalidate and max_stale.


Occasionally, they see extremely old content being served; e.g., if  
CC: max-age is 60s, they might see something go by which is 1000-3000  
seconds old (but still within the max_stale window).


The pattern that appears to trigger this is when a resource with an in- 
cache 200 response starts returning 404s; when this happens, Squid  
will start returning TCP_NEGATIVE_HIT/200's. E.g. (traffic driven by  
squidclient),


1713703.815  0 127.0.0.1 TCP_STALE_HIT/200 5234 GET http:// 
server1//5012904 - NONE/- application/json
1221713703.979164 0.0.0.0 TCP_ASYNC_MISS/404 193 GET http:// 
server1/5012904 - FIRST_UP_PARENT/back-end-server1 text/plain
1221713711.431  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server 
1/5012904 - NONE/- application/json
1221713720.978  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server 
1/5012904 - NONE/- application/json
1221713723.483  0 127.0.0.1 TCP_NEGATIVE_HIT/200 5234 GET http://server 
1/5012904 - NONE/- application/json


As you can see, stale-while-revalidate kicks in, and the async refresh  
brings back a 404, but that doesn't get stored properly.


Looking at the code, I *think* the culprit is storeNegativeCache(),  
which will, assuming that max_stale is set (either in squid.conf or  
response headers), block the new response from updating the cache --  
no matter what its status code is.


It makes sense to do this for 5xx status codes, because they're often  
transient, and reflect server-side problems. It doesn't make as much  
sense to do this for 4xx status codes, which reflect client-side  
issues. In those cases, you always want to update the cache with the  
most recent response (and potentially negative cache it, if the server  
is silly enough to not put a freshness lifetime on it).


The interesting thing, BTW, is that this only happens when collapsed  
forwarding is on, because this in httpReplyProcessHeader:


  if (neighbors_do_private_keys && ! 
Config.onoff.collapsed_forwarding)

httpMaybeRemovePublic(entry, reply);

masks this behaviour.

Thoughts? I'm not 100% on this diagnosis, as the use of peering and  
stale-while-revalidate make things considerably more complex, but I've  
had pretty good luck reproducing it... I'm happy to attempt a fix, but  
wanted input on what approach people preferred. Left to my own  
devices, I'd add another condition to this in storeNegativeCache():


if (oe && !EBIT_TEST(oe->flags, KEY_PRIVATE) && !EBIT_TEST(oe->flags,  
ENTRY_REVALIDATE))


to limit it to 5xx responses.



--
Mark Nottingham   [EMAIL PROTECTED]