On 7/02/2012 9:40 p.m., Henrik Nordström wrote:
tis 2012-02-07 klockan 14:01 +1300 skrev Amos Jeffries:
We have a long history of questions and bugs mentioning negative
numbers in the byte hit ratio.
I've always thought it was a bug we had not tracked down, but the FAQ
says it is correct.
http://wiki.squid-cache.org/SquidFaq/InnerWorkings#Why_do_I_see_negative_byte_hit_ratio.3F
Yes.. it's based on the difference between traffic squid<-servers and
clients<-squid. This can be negative (more traffic squid<-servers than
clients<-squid) in some situations.
- retried requests
- range retreival being processed by Squid
- continued download after client disconnects (quick_abort_...)
Wiki also mentions cache digests but ...
" /*
* This ugly hack is here to prevent the user from seeing a
* negative byte hit ratio. When we fetch a cache digest from
* a neighbor, it gets treated like a cache miss because the
* object is consumed internally. Thus, we subtract cache
* digest bytes out before calculating the byte hit ratio.
*/
cd = CountHist[0].cd.kbytes_recv.kb -
CountHist[minutes].cd.kbytes_recv.kb;
"
I've discussed this with a professional statistician I work with and
she agrees the algorithm is not calculating hit ratio as per our
definition of what a HIT is. What is does seem to be calculating is a
net traffic GAIN ratio.
Yes.
What I propose is make the numbers reported as HIT ratios use the same
algorithm. The current request ratio one. And to add alongside this a
record for Gain/Loss Ratio as output by this byte calculation.
Why is it interesting to calculate a nicer but very inaccurate number?
Which one is inaccurate?
"Hits as % of traffic sent" with calculation of (net traffic / client
bytes)
or
"Net traffic gain/loss" with calculation of (net traffic / client_bytes)
or
"Hits as % of client traffic" with calculation of ( sum_hits /
client_bytes )
One guess which one we have today ...
To hide that the proxy cache may actually cause higher bandwidth usage
than not having the proxy cache?
This is where the mistake rears its head. The excess server-side traffic
is not related to HITs, but to normal proxy behaviour. The HIT % of
client traffic may in fact be reducing that negative from some other
larger negative.
This is why I am more in favour of adding gain ratio alongside the hit
ratios or just changing the descriptive text. The negative is not lost
but explained.
Making HIT % use the same calculation as request ratio would mean adding
HIT traffic byte counters which don't exist now.
I would argue that the request hit ratio calculation is the broken one
from a statistical point of view.
The byte ratio calculation is simply that a byte ratio, no relevance to
HIT or MISS.
Traffic we classify as MISS is included in the divisor for the existing
byte algorithm.
If it were actually (client_traffic - server_traffic) / hit_bytes or
hit_bytes / (client_traffic - server_traffic) that would be an accurate
HIT bytes algorithm.
Instead we currently have (client_traffic - server_traffic) /
client_traffic which is the gain score for net traffic.
We get asked about "bandwidth gain" often, I think it would be useful to
have something in the report using the term "gain".
Amos