Re: [RFC] byte hit ratio

Amos Jeffries Tue, 07 Feb 2012 04:00:44 -0800

On 7/02/2012 9:40 p.m., Henrik Nordström wrote:

tis 2012-02-07 klockan 14:01 +1300 skrev Amos Jeffries:

We have a long history of questions and bugs mentioning negative
numbers in the byte hit ratio.


I've always thought it was a bug we had not tracked down, but the FAQ
says it is correct.
http://wiki.squid-cache.org/SquidFaq/InnerWorkings#Why_do_I_see_negative_byte_hit_ratio.3F

Yes.. it's based on the difference between traffic squid<-servers and
clients<-squid. This can be negative (more traffic squid<-servers than
clients<-squid) in some situations.

   - retried requests
   - range retreival being processed by Squid
   - continued download after client disconnects (quick_abort_...)


Wiki also mentions cache digests but ...
"    /*
     * This ugly hack is here to prevent the user from seeing a
     * negative byte hit ratio.  When we fetch a cache digest from
     * a neighbor, it gets treated like a cache miss because the
     * object is consumed internally.  Thus, we subtract cache
     * digest bytes out before calculating the byte hit ratio.
     */

cd = CountHist[0].cd.kbytes_recv.kb -CountHist[minutes].cd.kbytes_recv.kb;

I've discussed this with a professional statistician I work with and
she agrees the algorithm is not calculating hit ratio as per our
definition of what a HIT is. What is does seem to be calculating is a
net traffic GAIN ratio.

Yes.

What I propose is make the numbers reported as HIT ratios use the same
algorithm. The current request ratio one. And to add alongside this a
record for Gain/Loss Ratio as output by this byte calculation.

Why is it interesting to calculate a nicer but very inaccurate number?


Which one is inaccurate?

"Hits as % of traffic sent" with calculation of (net traffic / clientbytes)

or
 "Net traffic gain/loss" with calculation of (net traffic / client_bytes)
or

"Hits as % of client traffic" with calculation of ( sum_hits /client_bytes )


One guess which one we have today ...

To hide that the proxy cache may actually cause higher bandwidth usage
than not having the proxy cache?

This is where the mistake rears its head. The excess server-side trafficis not related to HITs, but to normal proxy behaviour. The HIT % ofclient traffic may in fact be reducing that negative from some otherlarger negative.This is why I am more in favour of adding gain ratio alongside the hitratios or just changing the descriptive text. The negative is not lostbut explained.

Making HIT % use the same calculation as request ratio would mean addingHIT traffic byte counters which don't exist now.


I would argue that the request hit ratio calculation is the broken one
from a statistical point of view.

The byte ratio calculation is simply that a byte ratio, no relevance toHIT or MISS.

Traffic we classify as MISS is included in the divisor for the existingbyte algorithm.If it were actually (client_traffic - server_traffic) / hit_bytes orhit_bytes / (client_traffic - server_traffic) that would be an accurateHIT bytes algorithm.

Instead we currently have (client_traffic - server_traffic) /client_traffic which is the gain score for net traffic.

We get asked about "bandwidth gain" often, I think it would be useful tohave something in the report using the term "gain".


Amos

Re: [RFC] byte hit ratio

Reply via email to