On Wed, Jul 23, 2003 at 03:52:43PM -0700, James B Robinson wrote: > Yo, Eric > > Imagine my surprise while perusing the changelogs for Apache to run across: > > *) mod_disk_cache works much better. This module should still > be considered experimental. [Eric Prud'hommeaux] > > I've just started trying to get either squid or apache to do mem caching > on a half dozen 4G linux boxes and I'm being a bit frustrated by Apache's > lack of reporting what is up in the cache. You know much about mod_cache, > mod_mem_cache or know of people who do?
I was working on some packages that interact with the Vary header and found that disk_cache wasn't paying correct attention to it anyways. I hacked (at) them a bit and left them in a state where they could do relatively simple caching operations. There should be no false hits but there are opportunities for false misses (entities that could have been served from cache but, do to proxy naivete, weren't). The miss scenario is as follows: -CACHE MISS BUG SCENARIO- C1: GET path1 HTTP/1.1 Accept: text/html;q=0.5,application/soap+xml;q=1.0 Accept-Language: fr;q=0.8, en;q=0.7 Accept-Charset: iso-8859-5, unicode-1-1;q=0.8 Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0 Foo: bar proxy passes request to document server (or upstream proxy) and gets back S1: 200 OK Vary: Accept,Accept-Language Expires: Wed, 31 Dec 2003 16:00:00 GMT A: b ...data... and dutifully records the entity along with all of the headers in C1 that were listed in S1.Vary. It stores them in a spot on the disk computed by the hash of path1. hash(path1): Accept: text/html;q=0.5,application/soap+xml;q=1.0 Accept-Language: fr;q=0.8, en;q=0.7 ...S1 data... A subsequent request comes in for path1. Lucky case: has the same headers and the proxy can match them against what it was written at hash(path1). It may have different charset and encoding as they were not listed in the Vary header. Another request comes in with a different Accept-Language header: C2: GET path1 HTTP/1.1 Accept: text/html;q=0.5,application/soap+xml;q=1.0 Accept-Language: esparanto,iso-latin-pig A false hit would be if the the proxy said "why I've got one of those" and gave back the cached entity. I think I made sure that won't happen. But, I believe the proxy will replace the previously cached entity by what comes back from S2 (upstream response to C2). hash(path1): Accept: text/html;q=0.5,application/soap+xml;q=1.0 Accept-Language: esparanto,iso-latin-pig ...S2 data... -APPRAISAL OF CURRENT CODE- The false miss comes when another request like C1 comes in. The proxy no longer maintains the response S1 so it gets a cache miss and sends a request upstream. The resulting inefficiency can be estimated by observing the traffic coming through and seeing if the clients are sending requests that vary by something listed in the responses Vary header, and that they are doing this more quickly than the entity would expire of natural causes. I believe that the current implementation is well worth its weight in CPU time and maintenance. On the maintenance front, I don't beleive mod_disk_cache cleans up after itself, but a simple find of files with an access time older than some interval will give you a nice least- recently-used algorithm. Or you can give it its own filesystem and let it bump its head and clean up whenever you feel like it. The CPU involved in a cache miss is pretty minimal, a couple entries into a module, computing a hash, and a file open failure. The CPU involved in a cache hit would have to be cracking large keys or searching for aliens before it would be comparable with the time to have sent the request to a distant server. So, we have a working system, but I think it could be improved easily: -PROPOSED FIX- The inefficiency comes from storing the cache entry at a hash calculated only by the request path. If the varied headers were added to that hash, we would have a place to store all the variations of the entity. But, we'd have to be clairvoyant to know which of the request headers that came in would be needed to calculate the hash. To find this, I believe hash(path) needs to contain a list of the Vary headers for the server response(s). ie, after the two request above, the proxy would have -PROPOSAL P1- hash(path1): Accept Accept-Language hash(path1 . Accept: ... . A-L: fr;q=0.8, en;q=0.7): Accept: text/html;q=0.5,application/soap+xml;q=1.0 Accept-Language: fr;q=0.8, en;q=0.7 ...S1 data... hash(path1 . Accept: ... . A-L: esparanto,iso-latin-pig): Accept: text/html;q=0.5,application/soap+xml;q=1.0 Accept-Language: esparanto,iso-latin-pig ...S2 data... -PROPOSAL P1a- Alternatively, hash(path1) could compute a directory name. The Vary list could be stored in hash(path1)/vary and the other documents could be stored in entries like hash(path1)/hash(Accept: ... . A-L: esparanto,iso-latin-pig). It should be easy to ensure that the name produced by hashing the varied headers would never collide with the special filename "vary". I'd like some feedback on the comparitive costs of two files vs. a directory with two files in it as this is the most common case for a non-varied request. The cost would appear to be higher as we are adding a directory in P1a, but that may help break up large directories. But I don't know filesystems. Who does? -ROCKS TO BE THROWN- This solution assumes that the Vary header will be constant for a given path. HTTP does not make this promise, so the cache module will need to rewrite the Vary header list if it was different from the Vary header of any response it recieve from upstream. We could solve this problem by... -PROPOSAL P1a1- ...walking the directory in P1a (above) to look for the first one that has headers all matching the current cache candidate. This elides the scenario where Vary headers are seemingly inconsistent: R3: GET /path1 HTTP/1.1 Accept: text/html;q=0.5,application/soap+xml;q=1.0 S3: 200 OK Vary: Accept R3: GET /path1 HTTP/1.1 Accept: text/html;q=0.5,application/soap+xml;q=1.0 Accept-Language: fr;q=0.8, en;q=0.7 S3: 200 OK Vary: Accept,Language but that's just messed up anyways. Who's baby is disk cache now anyways? `cvs log modules/experimental/mod_disk_cache.c` shows brianp doing a bit of protocol-level hacking there. My last patch was [[ date: 2002/08/18 12:33:05; author: stoddard; state: Exp; lines: +40 -2 Get mod_disk_cache working. Submitted by: Eric Prud'hommeaux Reviewes by: Paul Reder, Bill Stoddard ]] I'd like to hear from folks about the proposals above (P1, P1a, P1a1 and son of the return of proposal P1a1a1a strikes back in 3D) and the filesystem metrics. Also, is anyone here using disk_cache? It would be cool to make this a nice showpiece for how HTTP caching is supposed to work. -- -eric office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA cell: +1.857.222.5741 ([EMAIL PROTECTED]) Feel free to forward this message to any list for any purpose other than email address distribution.