Re: Possible new cache architecture
Graham Leggett wrote: I think in the long run, a dedicated process is the way to go. I think using a provider architecture would be best and keep complexity out of mod_cache. Some module(s) would implement the necessary cache management functions and mod_cache would push/pull/probe the "manager" using this interface. The manager may or may not be tied to the storage provider. We may have enough "generic interfaces" already to allow completely "stand alone" cache managers. At least, that's how I would do it... -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
> -Ursprüngliche Nachricht- > Von: Joe Orton > > > > 1. This is an API change which might be hard to backport. > > 2. I do not really like the close tie between the storage provider > >and the filter chain. It forces the provider to do things it > >should not care about from my point of view. > > At least this much could be solved I suppose by passing in a > callback of > type apr_brigade_flush which does the pass to f->next; the storage Sorry, but I guess that I do not understand this completely. So instead of passing f->next to store_body and make it call ap_pass_brigade with the "small" brigade and f->next you propose to create a callback function of type apr_brigade_flush inside of mod_cache and pass the pointer to this function and f->next to store_body, such that it can call this function with the "small" brigade and f->next as the ctx parameter of apr_brigade_flush? This function of course calls ap_pass_brigade then. > provider could remain filter-agnostic then. No idea about your other > issues, sorry. I will keep on thinking on this. Thanks for your help. Regards Rüdiger
Re: Possible new cache architecture
On Wed, May 03, 2006 at 02:07:44PM +0200, Plüm, Rüdiger, VF EITO wrote: > > -Ursprüngliche Nachricht- > > Von: Joe Orton > > > > The way I would expect it to work would be by passing f->next in to > > the store_body callback, it looks doomed to eat RAM as currently > > designed. mod_disk_cache's store_body implementation can then do: > > > > 1. read bucket(s) from brigade, appending to some temp brigade > > 2. write bucket(s) in temp brigade to cache file > > 3. pass temp brigade on to f->next > > 4. clear temp brigade to ensure memory is released > > 5. goto 1 > > Yes, this was also my idea, but I would like to avoid this, because: > > 1. This is an API change which might be hard to backport. > 2. I do not really like the close tie between the storage provider >and the filter chain. It forces the provider to do things it >should not care about from my point of view. At least this much could be solved I suppose by passing in a callback of type apr_brigade_flush which does the pass to f->next; the storage provider could remain filter-agnostic then. No idea about your other issues, sorry. joe
Re: Possible new cache architecture
On 5/3/06, Graham Leggett <[EMAIL PROTECTED]> wrote: Gonzalo Arana wrote: > again, I am in the dark: why do cache request headers may need to be > replaced or edited in the same entity? It's a requirement of the HTTP/1.1 spec. non-modified response headers to conditional requests need to update cached response headers. we should try to avoid 'dialog' with cache backend. The catch is when the server sent "304 Not Modified" - you need to update your cache to say "yep, my cached entry is still fresh", ie update the headers, without touching the body, which hasn't changed. I see the light now :). Having a single cache_admin proc/thread would make this easier, since any operation can be presented as atomic, while it may require more than a single syscall (I know, the goal is avoid full entity duplication). Anyway, I guess a good policy is to have 'editable' content as binary data (i.e., no variable length). Perhaps this is not possible anyway :(. Of course, to avoid a 'dialog' between httpd process and cache_admin, both cache_admin and httpd must be smart enough. > That's why I suggested a dedicated process/thread for cache > administration, which is not a good idea if too many lookups are > issued to this process on each request received. I think in the long run, a dedicated process is the way to go. +1 :). Regards, -- Gonzalo A. Arana
Re: Possible new cache architecture
On 05/03/2006 10:46 PM, Graham Leggett wrote: > > mod_cache definitely needs cache admin, currently it's implemented as an > external program that is called via cron, which doesn't help if you're > on a box without cron. Cache cleaning can be done either when a Not completely true. According to the documentation you can start it as a daemon (-d ,http://httpd.apache.org/docs/2.2/programs/htcacheclean.html#options) that runs periodically. Of course this daemon has to be started and configured separately from httpd, so it may not be the final solution. Regards Rüdiger
Re: Possible new cache architecture
Gonzalo Arana wrote: again, I am in the dark: why do cache request headers may need to be replaced or edited in the same entity? It's a requirement of the HTTP/1.1 spec. HTTP requests can be conditional, in other words a browser (or a proxy) can ask a server "give me this URL, but only if it has changed from my cached copy". If the server thinks that the file has changed (or Cache-Control: no-cache was specified), then the server will send a full response back headers + body, and the browser/proxy replaces it's cached copy with the new headers+body. If the server thinks that the file is the same, ie it didn't change, the server sends back the magic code "304 Not Modified", and just the headers - without any body. These new headers must replace the existing headers in the browser/proxy's cached entry, making the cached entry "fresh" again. And here lies the problem. Doing the request this way means you don't have to ask the backend "is my cached copy still fresh?", get an answer back "No", and then send a second request saying "ok then, give me the new data" - you can implement caching in one request. The catch is when the server sent "304 Not Modified" - you need to update your cache to say "yep, my cached entry is still fresh", ie update the headers, without touching the body, which hasn't changed. That's why I suggested a dedicated process/thread for cache administration, which is not a good idea if too many lookups are issued to this process on each request received. mod_cache definitely needs cache admin, currently it's implemented as an external program that is called via cron, which doesn't help if you're on a box without cron. Cache cleaning can be done either when a connection is complete in the existing process (which may be simpler to implement, but it runs after every connection), or it can be done as you suggest, where a dedicated thread/process handles this independently. I think in the long run, a dedicated process is the way to go. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
Roy T. Fielding wrote: That is a heck of a lot easier than convincing everyone to dump the current code based on an untested theory. I think the idea may be a lot more tested than you think. Most things I "suggest" have had an incubation period somewhere... I'm fine with not screwing with current mod_cache. I just think it should be either: renamed or made generic. We may or may not need a generic mod_backend_cache. I have posted a "psuedo-implementation" that got lost in the latest thread bloat. I can repost if anyone is interested. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Thanks for bringing me to the light. On 5/3/06, Graham Leggett <[EMAIL PROTECTED]> wrote: Gonzalo Arana wrote: > Excuse my ignorance in this matter, but about the 'cache sub-key' > issue, why not just use a generic cache (with some expiration model > -LRU, perhaps-) with a 'smart' comparison function? So far one of the best suggestions was from the patch posted recently, where the headers and body were in the same file, but where the headers were given "breathing room" before the cache body, so that the headers can be replaced (within reasonable limits). What this means is that each key/data entry is now a single file again (like in 1.3), which is much easier to clean up atomically. The problem still remains that an existing cache file's headers must be editable, without doing expensive operations like copying, and this again, I am in the dark: why do cache request headers may need to be replaced or edited in the same entity? editing must be atomic (no use one thread/process trying to serve content from the cache and halfway through, another thread tries to update the headers). This will require some form of locking, which may be too much of a performance drag, thus blowing the back-to-one-file idea out the water. this makes sense, but I still do not understand the origin of the problem (in-place header replacement). Problems with cache expiry though are a real problem that mod_cache suffers from now, and need to be fixed. That's why I suggested a dedicated process/thread for cache administration, which is not a good idea if too many lookups are issued to this process on each request received. Regards, -- Gonzalo A. Arana
Re: Possible new cache architecture
Roy T. Fielding wrote: For the record, Graham's statements were entirely correct, Brian's suggested architecture would slow the HTTP cache, No. It would simplify the existing implementation. The existing implementation, as Graham has noted, is not "fully functional." Graham argues - and I'm still mulling it over - that a generic cache architecture would get in the way of making a fully functional http cache. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Gonzalo Arana wrote: Excuse my ignorance in this matter, but about the 'cache sub-key' issue, why not just use a generic cache (with some expiration model -LRU, perhaps-) with a 'smart' comparison function? So far one of the best suggestions was from the patch posted recently, where the headers and body were in the same file, but where the headers were given "breathing room" before the cache body, so that the headers can be replaced (within reasonable limits). What this means is that each key/data entry is now a single file again (like in 1.3), which is much easier to clean up atomically. The problem still remains that an existing cache file's headers must be editable, without doing expensive operations like copying, and this editing must be atomic (no use one thread/process trying to serve content from the cache and halfway through, another thread tries to update the headers). This will require some form of locking, which may be too much of a performance drag, thus blowing the back-to-one-file idea out the water. Problems with cache expiry though are a real problem that mod_cache suffers from now, and need to be fixed. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
On Wed, 3 May 2006 11:39:02 -0700 "Roy T. Fielding" <[EMAIL PROTECTED]> wrote: > On May 3, 2006, at 5:56 AM, Davi Arnaut wrote: > > > On Wed, 3 May 2006 14:31:06 +0200 (SAST) > > "Graham Leggett" <[EMAIL PROTECTED]> wrote: > > > >> On Wed, May 3, 2006 1:26 am, Davi Arnaut said: > >> > Then you will end up with code that does not meet the > requirements of > HTTP, and you will have wasted your time. > >>> > >>> Yeah, right! How ? Hey, you are using the Monty Python argument > >>> style. > >>> Can you point to even one requirement of HTTP that my_cache_provider > >>> wont meet ? > >> > >> Yes. Atomic insertions and deletions, the ability to update headers > >> independantly of body, etc etc, just go back and read the thread. > > > > I can't argue with a zombie, you keep repeating the same > > misunderstands. > > > >> Seriously, please move this off list to keep the noise out of > >> people's > >> inboxes. > > > > Fine, I give up. > > For the record, Graham's statements were entirely correct, > Brian's suggested architecture would slow the HTTP cache, > and your responses have been amazingly childish for someone > who has earned zero credibility on this list. Fine, I do have zero credibility. > I suggest you stop defending a half-baked design theory and > just go ahead and implement something as a patch. If it works, > that's great. If it slows the HTTP cache, I will veto it myself. I'm already doing this. > There is, of course, no reason why the HTTP cache has to use > some new middle-layer back-end cache, so maybe you could just > stop arguing about vaporware and simply implement a single > mod_backend_cache that doesn't try to be all things to all people. > > Implement it and then convince people on the basis of measurements. > That is a heck of a lot easier than convincing everyone to dump > the current code based on an untested theory. > I just wanted to get comments (the original idea wasn't mine). It wasn't my intention to flame anyone, I'm not mad or anything. I was just stating my opinion. I maybe wrong, but I don't give up easy. :) -- Davi Arnaut
Re: Possible new cache architecture
On May 3, 2006, at 5:56 AM, Davi Arnaut wrote: On Wed, 3 May 2006 14:31:06 +0200 (SAST) "Graham Leggett" <[EMAIL PROTECTED]> wrote: On Wed, May 3, 2006 1:26 am, Davi Arnaut said: Then you will end up with code that does not meet the requirements of HTTP, and you will have wasted your time. Yeah, right! How ? Hey, you are using the Monty Python argument style. Can you point to even one requirement of HTTP that my_cache_provider wont meet ? Yes. Atomic insertions and deletions, the ability to update headers independantly of body, etc etc, just go back and read the thread. I can't argue with a zombie, you keep repeating the same misunderstands. Seriously, please move this off list to keep the noise out of people's inboxes. Fine, I give up. For the record, Graham's statements were entirely correct, Brian's suggested architecture would slow the HTTP cache, and your responses have been amazingly childish for someone who has earned zero credibility on this list. I suggest you stop defending a half-baked design theory and just go ahead and implement something as a patch. If it works, that's great. If it slows the HTTP cache, I will veto it myself. There is, of course, no reason why the HTTP cache has to use some new middle-layer back-end cache, so maybe you could just stop arguing about vaporware and simply implement a single mod_backend_cache that doesn't try to be all things to all people. Implement it and then convince people on the basis of measurements. That is a heck of a lot easier than convincing everyone to dump the current code based on an untested theory. Roy
Re: Possible new cache architecture
Excuse my ignorance in this matter, but about the 'cache sub-key' issue, why not just use a generic cache (with some expiration model -LRU, perhaps-) with a 'smart' comparison function? We could use as key full request headers (perhaps somewhat parsed), and as a comparison function a clever enough code to handle Vary, entity aging and so on. Best regards, -- Gonzalo A. Arana
Re: Possible new cache architecture
Brian Akins wrote: Does this discussion belong off-list? I would think this is the type of thing we need to discuss on this list. The technical discussion belongs on the list, flames not. Is there any consensus as to how to move forward? Do we just leave it as it is currently? There is a patch on the table, let's review it. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
William A. Rowe, Jr. wrote: --1. This is a development list. If you don't want development discussions, don't subscribe. I was referring to the flamebait, development discussions would obviously remain on the list. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
Brian Akins wrote: Moving towards and keeping with the above goals is a far higher priority than simplifying the generic backend cache interface. This response was a perfect summation of why we do *not* run the stock mod_cache here... Having the source means you can customise and improve the code to better meet your needs, and in your case your modifications work for you, and your organisation has the resources to commission and maintain those modifications. The trouble is, in order to be accepted into httpd, your modifications have to work for everyone else as well. Apparently for example the problem of trying to handle subkeys under a main key "is mod_http_cache's problem". Ok, so mod_httpd_cache now has to implement locking mechanisms to try and somehow turn the elegant (but overly simplistic) mod_cache into a cache that is practically useful. In the process we slow the cache down. The whole point of the cache is to speed things up. Suddenly, we lose the whole point of the exercise. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
Graham Leggett wrote: Seriously, please move this off list to keep the noise out of people's inboxes. --1. This is a development list. If you don't want development discussions, don't subscribe. Bill
Re: Possible new cache architecture
Graham Leggett wrote: Seriously, please move this off list to keep the noise out of people's inboxes. Does this discussion belong off-list? I would think this is the type of thing we need to discuss on this list. Is there any consensus as to how to move forward? Do we just leave it as it is currently? -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Graham Leggett wrote: Moving towards and keeping with the above goals is a far higher priority than simplifying the generic backend cache interface. This response was a perfect summation of why we do *not* run the stock mod_cache here... -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
On Wed, 3 May 2006 14:31:06 +0200 (SAST) "Graham Leggett" <[EMAIL PROTECTED]> wrote: > On Wed, May 3, 2006 1:26 am, Davi Arnaut said: > > >> Then you will end up with code that does not meet the requirements of > >> HTTP, and you will have wasted your time. > > > > Yeah, right! How ? Hey, you are using the Monty Python argument style. > > Can you point to even one requirement of HTTP that my_cache_provider > > wont meet ? > > Yes. Atomic insertions and deletions, the ability to update headers > independantly of body, etc etc, just go back and read the thread. I can't argue with a zombie, you keep repeating the same misunderstands. > Seriously, please move this off list to keep the noise out of people's > inboxes. Fine, I give up. -- Davi Arnaut
Re: Possible new cache architecture
On Wed, May 3, 2006 1:26 am, Davi Arnaut said: >> Then you will end up with code that does not meet the requirements of >> HTTP, and you will have wasted your time. > > Yeah, right! How ? Hey, you are using the Monty Python argument style. > Can you point to even one requirement of HTTP that my_cache_provider > wont meet ? Yes. Atomic insertions and deletions, the ability to update headers independantly of body, etc etc, just go back and read the thread. Seriously, please move this off list to keep the noise out of people's inboxes. Regards, Graham --
Re: Possible new cache architecture
> -Ursprüngliche Nachricht- > Von: Joe Orton > > The way I would expect it to work would be by passing f->next > in to the > store_body callback, it looks doomed to eat RAM as currently > designed. > mod_disk_cache's store_body implementation can then do: > > 1. read bucket(s) from brigade, appending to some temp brigade > 2. write bucket(s) in temp brigade to cache file > 3. pass temp brigade on to f->next > 4. clear temp brigade to ensure memory is released > 5. goto 1 Yes, this was also my idea, but I would like to avoid this, because: 1. This is an API change which might be hard to backport. 2. I do not really like the close tie between the storage provider and the filter chain. It forces the provider to do things it should not care about from my point of view. Furthermore: What about mod_cache in this case? Do you want to skip ap_pass_brigade there or do you want to cleanup the original brigade inside store_body of mod_disk_cache and let mod_cache pass an empty brigade up the chain? If we decide to skip ap_pass_brigade inside mod_cache all storage providers need to ensure that they pass the data up the chain which seems duplicated code to me and does not seem to belong to their core tasks. OTH doing this in mod_cache and only pass the small brigade to store_body of the provider has the drawback that mod_mem_cache wants to see the original file buckets in order to save the file descriptors of the files. To be honest, currently I have no solution at hand that I really like, but I agree that this really needs to be changed. Regards Rüdiger
Re: Possible new cache architecture
On Tue, May 02, 2006 at 02:21:27PM +0200, Plüm, Rüdiger, VF EITO wrote: > Another thing: I guess on systems with no mmap support the current > implementation > of mod_disk_cache will eat up a lot of memory if you cache a large local file, > because it transforms the file bucket(s) into heap buckets in this case. > Even if mmap is present I think that mod_disk_cache causes the file buckets > to be transformed into many mmap buckets if the file is large. Thus we do not > use sendfile in the case we cache the file. > I the case that a brigade only contains file_buckets it might be possible to > "copy" this brigade, sent it up the chain and process the copy of the brigade > for disk storage afterwards. Of course this opens a race if the file gets > changed in between these operations. > This approach does not work with socket or pipe buckets for obvious reasons. > Even heap buckets seem to be a somewhat critical idea because of the added > memory usage. The way I would expect it to work would be by passing f->next in to the store_body callback, it looks doomed to eat RAM as currently designed. mod_disk_cache's store_body implementation can then do: 1. read bucket(s) from brigade, appending to some temp brigade 2. write bucket(s) in temp brigade to cache file 3. pass temp brigade on to f->next 4. clear temp brigade to ensure memory is released 5. goto 1 joe
Re: Possible new cache architecture
On Wed, 03 May 2006 01:09:03 +0200 Graham Leggett <[EMAIL PROTECTED]> wrote: > Davi Arnaut wrote: > > > Graham, what I want is to be able to write a mod_cache backend _without_ > > having to worry about HTTP. > > Then you will end up with code that does not meet the requirements of > HTTP, and you will have wasted your time. Yeah, right! How ? Hey, you are using the Monty Python argument style. Can you point to even one requirement of HTTP that my_cache_provider wont meet ? > Please go through _all_ of the mod_cache architecture, and not just > mod_disk_cache. Also read and understand HTTP/1.1 gateways and caches, > and as you want to create a generic cache, read and understand mod_ldap, > a module that will probably benefit from the availability of a generic > cache. Then step back and see that mod_cache is a small part of a bigger > picture. At this point you'll see that as nice as your idea of a simple > generic cache interface is, it's not going to be the most elegant > solution to the problem. blah, blah.. you essentially said: "I don't want a simpler interface, I think the current mess is more elegant." I have shown you that I can even wrap your messy cache_provider hooks into a much simpler one, how can anything else be more elegant ? -- Davi Arnaut
Re: Possible new cache architecture
Davi Arnaut wrote: Graham, what I want is to be able to write a mod_cache backend _without_ having to worry about HTTP. Then you will end up with code that does not meet the requirements of HTTP, and you will have wasted your time. Please go through _all_ of the mod_cache architecture, and not just mod_disk_cache. Also read and understand HTTP/1.1 gateways and caches, and as you want to create a generic cache, read and understand mod_ldap, a module that will probably benefit from the availability of a generic cache. Then step back and see that mod_cache is a small part of a bigger picture. At this point you'll see that as nice as your idea of a simple generic cache interface is, it's not going to be the most elegant solution to the problem. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
On Tue, 02 May 2006 23:31:13 +0200 Graham Leggett <[EMAIL PROTECTED]> wrote: > Davi Arnaut wrote: > > >> The way HTTP caching works is a lot more complex than in your example, you > >> haven't taken into account conditional HTTP requests. > > > > I've taken into account the actual mod_disk_cache code! > > mod_disk_cache doesn't contain any of the conditional HTTP request code, > which is why you're not seeing it there. > > Please keep in mind that the existing mod_cache framework's goal is to > be a fully HTTP/1.1 compliant, content generator neutral, efficient, > error free and high performance cache. > > Moving towards and keeping with the above goals is a far higher priority > than simplifying the generic backend cache interface. > > To sum up - the cache backend must fulfill the requirements of the cache > frontend (generic or not), which in turn must fulfill the requirements > of the users, who are browsers, web robot code, and humans. To try and > prioritise this the other way round is putting the cart before the horse. Graham, what I want is to be able to write a mod_cache backend _without_ having to worry about HTTP. _NOT_ to rewrite mod_disk/proxy/cache/whatever! You keep talking about HTTP this, HTTP that, I wont change the way it currently works. I just want to place a glue beteween the storage and the HTTP part. I could even wrap around your code: typedef struct apr_status_t (*fetch) (cache_handle_t *h, apr_bucket_brigade *bb); apr_status_t (*store) (cache_handle_t *h, apr_bucket_brigade *bb); int (*remove) (const char *key); } my_cache_provider; typedef struct { const char *key_headers; const char *key_body; } my_cache_object; create_entity: my_cache_object *obj; obj->key_headers = hash_headers(request, whatever); obj->key_body = hash_body(request, whatever); open_entity: my_cache_object *obj; my_provider->fetch(h, obj->key_headers, header_brigade); // if necessary, update obj->key_headers/body (vary..) remove_url: my_provider->remove(obj->key_header); my_provider->remove(obj->key_body); remove_entity: nop store_headers: my_cache_object *obj; // if necessary, update obj->key_headers (vary..) my_provider->store(h, obj->key_headers, header_brigade); store_body: my_cache_object *obj; my_provider->store(h, obj->key_body, body_brigade) recall_headers: my_cache_object *obj; my_provider->fetch(h, obj->key_headers, header_brigade); recall_body: my_cache_object *obj; my_provider->fetch(h, obj->key_body, body_brigade); -- Davi Arnaut
Re: Possible new cache architecture
Davi Arnaut wrote: The way HTTP caching works is a lot more complex than in your example, you haven't taken into account conditional HTTP requests. I've taken into account the actual mod_disk_cache code! mod_disk_cache doesn't contain any of the conditional HTTP request code, which is why you're not seeing it there. Please keep in mind that the existing mod_cache framework's goal is to be a fully HTTP/1.1 compliant, content generator neutral, efficient, error free and high performance cache. Moving towards and keeping with the above goals is a far higher priority than simplifying the generic backend cache interface. To sum up - the cache backend must fulfill the requirements of the cache frontend (generic or not), which in turn must fulfill the requirements of the users, who are browsers, web robot code, and humans. To try and prioritise this the other way round is putting the cart before the horse. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
On 5/2/06, Brian Akins <[EMAIL PROTECTED]> wrote: Gonzalo Arana wrote: > What problems have you seen with this approach? postfix uses this > architecture, for instance. Postfix implements SMTP, which is an asynchronous protocol. and which problems may bring this approach? > Excuse my ignorance, what does "event mpm ... keep the balance very > good" mean? Not all your threads are tied up doing keepalives, for example. ah, I see (I was unfamiliar with event MPM, sory). -- Gonzalo A. Arana
Re: Possible new cache architecture
Gonzalo Arana wrote: What problems have you seen with this approach? postfix uses this architecture, for instance. Postfix implements SMTP, which is an asynchronous protocol. Excuse my ignorance, what does "event mpm ... keep the balance very good" mean? Not all your threads are tied up doing keepalives, for example. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
On 5/2/06, Brian Akins <[EMAIL PROTECTED]> wrote: Gonzalo Arana wrote: > A more suitable design for this task I think would be to make each > process to have a special purpose: cache maintenance (purging expired > entries, purging entries to make room for new ones, creating new > entries, and so on), request processing (network/disk I/O, content > filtering, and so on), or what ever. In my experience, this always sounds good in theory, but just doesn't ever work in the real world. The event mpm is "sorta" a step in that direction, but seems to keep the balance pretty good. What problems have you seen with this approach? postfix uses this architecture, for instance. Excuse my ignorance, what does "event mpm ... keep the balance very good" mean? -- Gonzalo A. Arana
Re: Possible new cache architecture
Gonzalo Arana wrote: A more suitable design for this task I think would be to make each process to have a special purpose: cache maintenance (purging expired entries, purging entries to make room for new ones, creating new entries, and so on), request processing (network/disk I/O, content filtering, and so on), or what ever. In my experience, this always sounds good in theory, but just doesn't ever work in the real world. The event mpm is "sorta" a step in that direction, but seems to keep the balance pretty good. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
On Tue, 2 May 2006 17:22:00 +0200 (SAST) "Graham Leggett" <[EMAIL PROTECTED]> wrote: > On Tue, May 2, 2006 7:06 pm, Davi Arnaut said: > > > There is not such scenario. I will simulate a request using the disk_cache > > format: > > The way HTTP caching works is a lot more complex than in your example, you > haven't taken into account conditional HTTP requests. I've taken into account the actual mod_disk_cache code! Let me try to translate your typical scenario. > A typical conditional scenario goes like this: > > - Browser asks for URL from httpd. Same. > - Mod_cache has a cached copy by looking up the headers BUT - it's stale. > mod_cache converts the browser's original request to a conditional request > by adding the header If-None-Match. sed s/mod_cache/mod_http_cache > - The backend server answers "no worries, what you have is still fresh" by > sending a "304 Not Modified". sed s/mod_cache/mod_http_cache > - mod_cache takes the headers from the 304, and replaces the headers on > the cached entry, in the process making the entry "fresh" again. sed s/mod_cache/mod_http_cache > - mod_cache hands the cached data back to the browser. sed s/mod_cache/mod_http_cache > Read http://www.ietf.org/rfc/rfc2616.txt section 13 (mainly) to see in > detail how this works. Again: we do not want to change the semantics, we only want to separate the HTTP specific part from the storage specific part. The HTTP specific parts of mod_disk_cache, mod_mem_cache and mod_cache are moved to a mod_http_cache, while retaining the storage specific parts. And mod_cache is the one who will combine those two layers. Again: it's the same thing as we were replacing all mod_disk_cache file operations by hash table operations. -- Davi Arnaut
Re: Possible new cache architecture
Seems to me that the thundering herd / performance degradation is inherent to apache design: all threads/processes are exact clones. A more suitable design for this task I think would be to make each process to have a special purpose: cache maintenance (purging expired entries, purging entries to make room for new ones, creating new entries, and so on), request processing (network/disk I/O, content filtering, and so on), or what ever. This way, performance degradation caused by cache mutex can be minimized. Request processors would only get queued/locked when querying the cache, which can be made a single operation if cache is smart enough to figure out the right response from original request, right? Regards, -- Gonzalo A. Arana
Re: Possible new cache architecture
On Tue, May 2, 2006 5:50 pm, Brian Akins said: > This seems more like a wish list. I just want to separate out the cache > and protocol stuff. HTTP compliance isn't a wish, it's a requirement. A patch that breaks compliance will end up being -1'ed. The thundering herd issues are also a requirement, as provision was made for it in the v2.0 design. The cache must deliver what the HTTP cache requires (which in turn delivers what users require), not the other way around. Separating the cache and the protocol has advantages, but it also has the disadvantage that fixing bugs like thundering herd may require interface changes, forcing people to have to wait for major version number changes before they see their problems fixed. In this scenario, the separation of cache and protocol is (very) nice to have, but not so nice that end users are disadvantaged. >> - The ability to amend a subkey (the headers) on an entry that is >> already >> cached. > > mod_http_cache should handle. to new mod_cache, it's just another > key/value. How does mod_http_cache do this without the need for locking (and thus performance degradation)? How does mod_cache guarantee that it won't expire the body without atomically expiring the headers with it? >> - The ability to invalidate a particular cached variant (ie headers + >> data) in one atomic step, without affecting threads that hold that >> cached >> entry open at the time. > > mod_http_cache should handle. Entry invalidation is definitely mod_cache's problem, it falls under cache size maintenance and expiry. Remember that mod_http_cache only runs when requests are present, entry invalidation has to happen whether there are requests present or not, via a separate thread, separate process, cron job, whatever. >> - The ability to read from a cached object that is still being written >> to. > > Nice to have. out of scope for what I am proposing. new mod_cache > should be the place to implement this if underlying provider supports it. It's not nice to have, no. It's a real problem that has inspired people to log bugs, and very recently, for one person to submit a patch. Regards, Graham --
Re: Possible new cache architecture
Graham Leggett wrote: To be HTTP compliant, and to solve thundering herd, we need the following from a cache: This seems more like a wish list. I just want to separate out the cache and protocol stuff. - The ability to amend a subkey (the headers) on an entry that is already cached. mod_http_cache should handle. to new mod_cache, it's just another key/value. - The ability to invalidate a particular cached variant (ie headers + data) in one atomic step, without affecting threads that hold that cached entry open at the time. mod_http_cache should handle. Keep a list of variants cached - this should use a provider interface as well. mod_cache would handle whatever locking, ref counting, etc, needs to be done, if any. - The ability to read from a cached object that is still being written to. Nice to have. out of scope for what I am proposing. new mod_cache should be the place to implement this if underlying provider supports it. - A guarantee that the result of a broken write (segfault, timeout, connection reset by peer, whatever) will not result in a broken cached entry (ie that the cached entry will eventually be invalidated, and all threads trying to read from it will eventually get an error). agreed. new mod_cache should handle this. Certainly separate the protocol from the physical cache, just make sure the physical cache delivers the shopping list above :) Most seem like protocol specific stuff. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
On Tue, May 2, 2006 5:27 pm, Brian Akins said: > Still not sure how this is different from what we are proposing. we > really want to separate protocol from cache stuff. If we have a > "revalidate" for the generic cache it should address all your concerns. > ??? To be HTTP compliant, and to solve thundering herd, we need the following from a cache: - The ability to amend a subkey (the headers) on an entry that is already cached. - The ability to invalidate a particular cached variant (ie headers + data) in one atomic step, without affecting threads that hold that cached entry open at the time. - The ability to read from a cached object that is still being written to. - A guarantee that the result of a broken write (segfault, timeout, connection reset by peer, whatever) will not result in a broken cached entry (ie that the cached entry will eventually be invalidated, and all threads trying to read from it will eventually get an error). Certainly separate the protocol from the physical cache, just make sure the physical cache delivers the shopping list above :) Regards, Graham --
Re: Possible new cache architecture
Graham Leggett wrote: The way HTTP caching works is a lot more complex than in your example, you haven't taken into account conditional HTTP requests. ... Still not sure how this is different from what we are proposing. we really want to separate protocol from cache stuff. If we have a "revalidate" for the generic cache it should address all your concerns. ??? -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
On Tue, May 2, 2006 7:06 pm, Davi Arnaut said: > There is not such scenario. I will simulate a request using the disk_cache > format: The way HTTP caching works is a lot more complex than in your example, you haven't taken into account conditional HTTP requests. A typical conditional scenario goes like this: - Browser asks for URL from httpd. - Mod_cache has a cached copy by looking up the headers BUT - it's stale. mod_cache converts the browser's original request to a conditional request by adding the header If-None-Match. - The backend server answers "no worries, what you have is still fresh" by sending a "304 Not Modified". - mod_cache takes the headers from the 304, and replaces the headers on the cached entry, in the process making the entry "fresh" again. - mod_cache hands the cached data back to the browser. Read http://www.ietf.org/rfc/rfc2616.txt section 13 (mainly) to see in detail how this works. Regards, Graham --
Re: Possible new cache architecture
On Tue, 2 May 2006 15:40:30 +0200 (SAST) "Graham Leggett" <[EMAIL PROTECTED]> wrote: > On Tue, May 2, 2006 3:24 pm, Brian Akins said: > > >> - the cache says "cool, will send my copy upstream. Oops, where has my > >> data gone?". > > > So, the cache says, okay must get content the old fashioned way (proxy, > > filesystem, magic fairies, etc.). > > > > Where's the issue? > > To rephrase it, a whole lot of extra code, which has to be written and > debugged, has to say "oops, ok sorry backend about the If-None-Match, I > thought I had it cached but I actually didn't, please can I have the full > file?". Then the backend gives you a response with different headers to > those you already delivered to the frontend. Oops. There is not such scenario. I will simulate a request using the disk_cache format: . Incoming client requests URI /foo/bar/baz . Request goes through mod_http_cache, Generate off of URI . mod_http_cache ask mod_cache for the data associated with key: .header . No data: . Fetch from upstream . Data Fetched: . If format #1 (Contains a list of Vary Headers): . Use each header name (from .header) with our request values (headers_in) to regenerate using HeaderName+ HeaderValue+URI . Ask mod_cache for data with key: .header . No data: . Fetch from upstream . Data: . Serve data to client . If format #2 . Serve data to client Where is the difference ? > Keeping the code as simple as possible will keep your code bug free, which > means less time debugging for you, and less time for end users trying to > figure out what the cause is of their weird symptoms. We are trying to get it more simple as possible by separating the storage layer from the protocol layer. -- Davi Arnaut Davi Arnaut
Re: Possible new cache architecture
> -Ursprüngliche Nachricht- > Von: Niklas Edmundsson > > Correct. When caching a 4.3GB file on a 32bit arch it gets so > bad that > mmap eats all your address space and the thing segfaults. I initally > thought it was eating memory, but that's only if you have mmap > disabled. Ahh, good point. So I guess its needed to remove the mmap buckets in the loop from the brigade. Regards Rüdiger
Re: Possible new cache architecture
On Tue, May 2, 2006 3:24 pm, Brian Akins said: >> - the cache says "cool, will send my copy upstream. Oops, where has my >> data gone?". > So, the cache says, okay must get content the old fashioned way (proxy, > filesystem, magic fairies, etc.). > > Where's the issue? To rephrase it, a whole lot of extra code, which has to be written and debugged, has to say "oops, ok sorry backend about the If-None-Match, I thought I had it cached but I actually didn't, please can I have the full file?". Then the backend gives you a response with different headers to those you already delivered to the frontend. Oops. Keeping the code as simple as possible will keep your code bug free, which means less time debugging for you, and less time for end users trying to figure out what the cause is of their weird symptoms. Regards, Graham --
Re: Possible new cache architecture
On Tue, 2 May 2006 11:22:31 +0200 (MEST) Niklas Edmundsson <[EMAIL PROTECTED]> wrote: > On Mon, 1 May 2006, Davi Arnaut wrote: > > > More important, if we stick with the key/data concept it's possible to > > implement the header/body relationship under single or multiple keys. > > I've been hacking on mod_disk_cache to make it: > * Only store one set of data when one uncached item is accessed >simultaneously (currently all requests cache the file and the last >finished cache process is "wins"). > * Don't wait until the whole item is cached, reply while caching >(currently it stalls). > * Don't block the requesting thread when requestng a large uncached >item, cache in the background and reply while caching (currently it >stalls). > > This is mostly aimed at serving huge static files from a slow disk > backend (typically an NFS export from a server holding all the disk), > such as http://ftp.acc.umu.se/ and http://ftp.heanet.ie/ . > > Doing this with the current mod_disk_cache disk layout was not > possible, doing the above without unneccessary locking means: > > * More or less atomic operations, so caching headers and data in >separate files gets very messy if you want to keep consistency. > * You can't use tempfiles since you want to be able to figure out >where the data is to be able to reply while caching. > * You want to know the size of the data in order to tell when you're >done (ie the current size of a file isn't necessarily the real size >of the body since it might be caching while we're reading it). > > In the light of our experiences, I really think that you want to have > a concept that allows you to keep the bond between header and data. > Yes, you can patch up a missing bond by require locking and stuff, but > I really prefer not having to lock cache files when doing read access. > When it comes to "make the common case fast" a lockless design is very > much preferred. I will repeat once again: there is no locking involved, unless your format of storing the header/data is really wrong. _The data format is up to the module using it_, while the storage backend is a completely different issue. > However, if all those issues are sorted out in the layer above disk > cache then the above observations becomes more or less moot. Yes, that's the point. > In any case the patch is more or less finished, independent testing > and auditing haven't been done yet but I can submit a preliminary > jumbo-patch if people are interested in having a look at it now. -- Davi Arnaut
Re: Possible new cache architecture
On Tue, 2 May 2006, Graham Leggett wrote: If it's: * Link to latest GNOME Live CD gets published on Slashdot. * A gazillion users click the link to download it. * mod_disk_cache starts a new instance of caching the file for each request, until someone has completed caching the file. Then this is the thundering herd problem :) OK :) Either a site is slashdotted (as in your case), or a cached entry expires, and suddenly the backend gets nailed until at least one request "wins", then we are back to normal serving from the cache. In your case, the "backend" is the disk, while in the bug from 1998, the backend was another webserver. Either way, same problem. OK. Then this patch solves the problem regardless of whether it's a static file or dynamically generated content since it only allows one instance to cache the file (OK, there's a small hole so there can be multiple instances but it's wy smaller than now), all other instances delivers data as the caching process is writing it. Additionally, if it's a static file that's allowed to be cached in the background it solves: * Reduce chance of user getting bored since the data is delivered while being cached. * The user got bored and closed the connection so the painfully cached file gets deleted. Hmmm - thinking about this we try to cache the brigade (all X GB of it) first, then we try write it to the network, thus the delay. Does your patch solve all of these already, or are they planned? It solves everything I've mentioned. The solution is probably not perfect for the not-static-file case since it falls back to the old behaviour of caching the whole file, but it should be a lot better than the current mod_disk_cache since the rest of the threads get reply-while-caching. There are issues here with the fact that the result is discarded if the connection is aborted, but I'm not familiar enough with apache filter internals to state that you can keep the result even though the connection is aborted. /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | [EMAIL PROTECTED] --- Anything is edible if it's chopped finely enough =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Re: Possible new cache architecture
Graham Leggett wrote: - the cache says "cool, will send my copy upstream. Oops, where has my data gone?". So, the cache says, okay must get content the old fashioned way (proxy, filesystem, magic fairies, etc.). Where's the issue? -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
On Tue, 2 May 2006, Plüm, Rüdiger, VF EITO wrote: Another thing: I guess on systems with no mmap support the current implementation of mod_disk_cache will eat up a lot of memory if you cache a large local file, because it transforms the file bucket(s) into heap buckets in this case. Even if mmap is present I think that mod_disk_cache causes the file buckets to be transformed into many mmap buckets if the file is large. Thus we do not use sendfile in the case we cache the file. Correct. When caching a 4.3GB file on a 32bit arch it gets so bad that mmap eats all your address space and the thing segfaults. I initally thought it was eating memory, but that's only if you have mmap disabled. I the case that a brigade only contains file_buckets it might be possible to "copy" this brigade, sent it up the chain and process the copy of the brigade for disk storage afterwards. Of course this opens a race if the file gets changed in between these operations. This approach does not work with socket or pipe buckets for obvious reasons. Even heap buckets seem to be a somewhat critical idea because of the added memory usage. I did the somewhat naive approach of only doing background caching when the buckets refer to a single sequential file. It's not perfect, but it solves the main case where you get a huge amount of data to store ... /Nikke - stumbled upon more than one bug when digging into mod_disk_cache -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | [EMAIL PROTECTED] --- Anything is edible if it's chopped finely enough =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Re: Possible new cache architecture
On Tue, May 2, 2006 2:18 pm, Niklas Edmundsson said: > Exactly what is the thundering herd problem? I can guess the general > problem, but without a more precise definition I can't really say if > my patch fixes it or not. > > If it's: > * Link to latest GNOME Live CD gets published on Slashdot. > * A gazillion users click the link to download it. > * mod_disk_cache starts a new instance of caching the file for each >request, until someone has completed caching the file. Then this is the thundering herd problem :) Either a site is slashdotted (as in your case), or a cached entry expires, and suddenly the backend gets nailed until at least one request "wins", then we are back to normal serving from the cache. In your case, the "backend" is the disk, while in the bug from 1998, the backend was another webserver. Either way, same problem. > Then this patch solves the problem regardless of whether it's a static > file or dynamically generated content since it only allows one > instance to cache the file (OK, there's a small hole so there can be > multiple instances but it's wy smaller than now), all other > instances delivers data as the caching process is writing it. > Additionally, if it's a static file that's allowed to be cached in > the background it solves: > * Reduce chance of user getting bored since the data is delivered >while being cached. > * The user got bored and closed the connection so the painfully cached >file gets deleted. Hmmm - thinking about this we try to cache the brigade (all X GB of it) first, then we try write it to the network, thus the delay. Does your patch solve all of these already, or are they planned? Regards, Graham --
Re: Possible new cache architecture
> -Ursprüngliche Nachricht- > Von: Graham Leggett > > The reason it does not work currently is that that a local file > > usually is > > delivered in one brigade with, depending on the size of the > file, one or > > more > > file buckets. > > Hmmm - ok, this makes sense. > > Something I've never checked for, do output filters support > asynchronous > writes? I don't think so. Of course this would be a nice feature. Maybe somehow possible with Colm's ideas. Another thing: I guess on systems with no mmap support the current implementation of mod_disk_cache will eat up a lot of memory if you cache a large local file, because it transforms the file bucket(s) into heap buckets in this case. Even if mmap is present I think that mod_disk_cache causes the file buckets to be transformed into many mmap buckets if the file is large. Thus we do not use sendfile in the case we cache the file. I the case that a brigade only contains file_buckets it might be possible to "copy" this brigade, sent it up the chain and process the copy of the brigade for disk storage afterwards. Of course this opens a race if the file gets changed in between these operations. This approach does not work with socket or pipe buckets for obvious reasons. Even heap buckets seem to be a somewhat critical idea because of the added memory usage. Regards Rüdiger
Re: Possible new cache architecture
On Tue, 2 May 2006, Graham Leggett wrote: This is great, in doing this you've been solving a proxy bug that was first reported in 1998 :). This already works in the case you get the data from the proxy backend. It does not work for local files that get cached (the scenario Niklas uses the cache for). Ok then I have misunderstood - I was referring to the thundering herd problem. Exactly what is the thundering herd problem? I can guess the general problem, but without a more precise definition I can't really say if my patch fixes it or not. If it's: * Link to latest GNOME Live CD gets published on Slashdot. * A gazillion users click the link to download it. * mod_disk_cache starts a new instance of caching the file for each request, until someone has completed caching the file. Then this patch solves the problem regardless of whether it's a static file or dynamically generated content since it only allows one instance to cache the file (OK, there's a small hole so there can be multiple instances but it's wy smaller than now), all other instances delivers data as the caching process is writing it. Additionally, if it's a static file that's allowed to be cached in the background it solves: * Reduce chance of user getting bored since the data is delivered while being cached. * The user got bored and closed the connection so the painfully cached file gets deleted. /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | [EMAIL PROTECTED] --- Illiterate? Write for information! =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Re: Possible new cache architecture
On Tue, May 2, 2006 12:16 pm, Plüm, Rüdiger, VF EITO said: >> This is great, in doing this you've been solving a proxy bug that was >> first reported in 1998 :). > > This already works in the case you get the data from the proxy backend. It > does > not work for local files that get cached (the scenario Niklas uses the > cache > for). Ok then I have misunderstood - I was referring to the thundering herd problem. > The reason it does not work currently is that that a local file > usually is > delivered in one brigade with, depending on the size of the file, one or > more > file buckets. Hmmm - ok, this makes sense. Something I've never checked for, do output filters support asynchronous writes? If they did, this might solve this problem - the write request would return immediately, allowing the read from file and write to cached file to continue while the write to network blocked. Regards, Graham --
Re: Possible new cache architecture
> -Ursprüngliche Nachricht- > Von: Graham Leggett > > * Don't block the requesting thread when requestng a large uncached > >item, cache in the background and reply while caching > (currently it > >stalls). > > This is great, in doing this you've been solving a proxy bug that was > first reported in 1998 :). This already works in the case you get the data from the proxy backend. It does not work for local files that get cached (the scenario Niklas uses the cache for). The reason it does not work currently is that that a local file usually is delivered in one brigade with, depending on the size of the file, one or more file buckets. For Niklas purposes Colm's ideas regarding the use of the new Linux system calls tee and splice will get handy (http://mail-archives.apache.org/mod_mbox/apr-dev/200604.mbox/[EMAIL PROTECTED]) as they should speed up such things. Regards Rüdiger
Re: Possible new cache architecture
On Tue, May 2, 2006 11:22 am, Niklas Edmundsson said: > I've been hacking on mod_disk_cache to make it: > * Only store one set of data when one uncached item is accessed >simultaneously (currently all requests cache the file and the last >finished cache process is "wins"). > * Don't wait until the whole item is cached, reply while caching >(currently it stalls). > * Don't block the requesting thread when requestng a large uncached >item, cache in the background and reply while caching (currently it >stalls). This is great, in doing this you've been solving a proxy bug that was first reported in 1998 :). The only things to be careful of is for Cache-Control: no-cache and friends to be handled gracefully (the partially cached file should be marked as "delete-me" so that the current request creates a new cache file / no cache file. Existing running downloads should be unaffected by this.), and for backend failures (either a timeout or a premature socket close) to cause the cache entry to be invalidated and deleted. > * More or less atomic operations, so caching headers and data in >separate files gets very messy if you want to keep consistency. Keep in mind that HTTP/1.1 compliance requires that the headers be updatable without changing the body. > * You can't use tempfiles since you want to be able to figure out >where the data is to be able to reply while caching. > * You want to know the size of the data in order to tell when you're >done (ie the current size of a file isn't necessarily the real size >of the body since it might be caching while we're reading it). The cache already wants to know the size of the data so that it can decide whether it's prepared to try and cache the file in the first place, so in theory this should not be a problem. > In any case the patch is more or less finished, independent testing > and auditing haven't been done yet but I can submit a preliminary > jumbo-patch if people are interested in having a look at it now. Post it, people can take a look. Regards, Graham --
Re: Possible new cache architecture
On Mon, 1 May 2006, Davi Arnaut wrote: More important, if we stick with the key/data concept it's possible to implement the header/body relationship under single or multiple keys. I've been hacking on mod_disk_cache to make it: * Only store one set of data when one uncached item is accessed simultaneously (currently all requests cache the file and the last finished cache process is "wins"). * Don't wait until the whole item is cached, reply while caching (currently it stalls). * Don't block the requesting thread when requestng a large uncached item, cache in the background and reply while caching (currently it stalls). This is mostly aimed at serving huge static files from a slow disk backend (typically an NFS export from a server holding all the disk), such as http://ftp.acc.umu.se/ and http://ftp.heanet.ie/ . Doing this with the current mod_disk_cache disk layout was not possible, doing the above without unneccessary locking means: * More or less atomic operations, so caching headers and data in separate files gets very messy if you want to keep consistency. * You can't use tempfiles since you want to be able to figure out where the data is to be able to reply while caching. * You want to know the size of the data in order to tell when you're done (ie the current size of a file isn't necessarily the real size of the body since it might be caching while we're reading it). In the light of our experiences, I really think that you want to have a concept that allows you to keep the bond between header and data. Yes, you can patch up a missing bond by require locking and stuff, but I really prefer not having to lock cache files when doing read access. When it comes to "make the common case fast" a lockless design is very much preferred. However, if all those issues are sorted out in the layer above disk cache then the above observations becomes more or less moot. In any case the patch is more or less finished, independent testing and auditing haven't been done yet but I can submit a preliminary jumbo-patch if people are interested in having a look at it now. /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | [EMAIL PROTECTED] --- Want to forget all your troubles? Wear tight shoes. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Re: Possible new cache architecture
On Mon, 01 May 2006 22:46:44 +0200 Graham Leggett <[EMAIL PROTECTED]> wrote: > Brian Akins wrote: > > >> That's two hits to find whether something is cached. > > > > You must have two hits if you support vary. > > You need only one - bring up the original cached entry with the key, and > then use cheap subkeys over a very limited data set to find both the > variants and the header/data. > > >> How are races prevented? > > > > shouldn't be any. something is in the cache or not. if one "piece" of > > an http "object" is not valid or in cache, the object is invalid. > > Although other variants may be valid/in cache. > > I can think of one race off the top of my head: > > - the browser says "send me this URL". > > - the cache has it cached, but it's stale, so it asks the backend > "If-None-Match". > > - the cache reaper comes along, says "oh, this is stale", and reaps the > cached body (which is independant, remember?). The data is no longer > cached even though the headers still exist. > > - The backend says "304 Not Modified". > > - the cache says "cool, will send my copy upstream. Oops, where has my > data gone?". Sorry, but this only happens in your imagination. It's pretty obvious that mod_cache_http will handle this. > The end user will probably experience this as "oh, the website had a > glitch, let me try again", so it won't be reported as a bug. No. > Ok, so you tried to lock the body before going to the backend, but > searching for and locking the body would have been an additional wasted > cache hit if the backend answered with its own body. Not to mention > having to write and debug code to do this. Locks are not necessary, perhaps you are imaginating something very different. If a data body disappears under mod_http_cache it is not a big deal! It will refuse to serve the request from the cache and a new version of the page will be cached. > Races need to be properly handled, and atomic cache operations will go a > long way to prevent them. I think we are discussing apples and oranges. First, we only want to *organize* the current cache code into a more layered solution. The current semantics won't change, yet! -- Davi Arnaut
Re: Possible new cache architecture
Graham Leggett wrote: Brian Akins wrote: That's two hits to find whether something is cached. You must have two hits if you support vary. You need only one - bring up the original cached entry with the key, and then use cheap subkeys over a very limited data set to find both the variants and the header/data. How are races prevented? shouldn't be any. something is in the cache or not. if one "piece" of an http "object" is not valid or in cache, the object is invalid. Although other variants may be valid/in cache. I can think of one race off the top of my head: - the browser says "send me this URL". - the cache has it cached, but it's stale, so it asks the backend "If-None-Match". - the cache reaper comes along, says "oh, this is stale", and reaps the cached body (which is independant, remember?). The data is no longer cached even though the headers still exist. - The backend says "304 Not Modified". - the cache says "cool, will send my copy upstream. Oops, where has my data gone?". I think that can be avoided by, instead of reaping the cached body, actually setting aside the cached body (public > private), by changing it's key or whatnot. Then - throw it away after the backend says "200 OK", and replace it with something new. Or, rekey it a second time (private > public) when the backend reports "304 NOT MODIFIED". In the race, one will set it aside looking for another, the second will make a fresh request (it doesn't see it in the cache), and either the first or second request will wrap up -last- to place the final copy back into the cache, replacing the document from the winner. No harm no foul. Bill
Re: Possible new cache architecture
Brian Akins wrote: That's two hits to find whether something is cached. You must have two hits if you support vary. You need only one - bring up the original cached entry with the key, and then use cheap subkeys over a very limited data set to find both the variants and the header/data. How are races prevented? shouldn't be any. something is in the cache or not. if one "piece" of an http "object" is not valid or in cache, the object is invalid. Although other variants may be valid/in cache. I can think of one race off the top of my head: - the browser says "send me this URL". - the cache has it cached, but it's stale, so it asks the backend "If-None-Match". - the cache reaper comes along, says "oh, this is stale", and reaps the cached body (which is independant, remember?). The data is no longer cached even though the headers still exist. - The backend says "304 Not Modified". - the cache says "cool, will send my copy upstream. Oops, where has my data gone?". The end user will probably experience this as "oh, the website had a glitch, let me try again", so it won't be reported as a bug. Ok, so you tried to lock the body before going to the backend, but searching for and locking the body would have been an additional wasted cache hit if the backend answered with its own body. Not to mention having to write and debug code to do this. Races need to be properly handled, and atomic cache operations will go a long way to prevent them. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
On Mon, 01 May 2006 15:46:58 -0400 Brian Akins <[EMAIL PROTECTED]> wrote: > Graham Leggett wrote: > > > That's two hits to find whether something is cached. > > You must have two hits if you support vary. > > > How are races prevented? > > shouldn't be any. something is in the cache or not. if one "piece" of > an http "object" is not valid or in cache, the object is invalid. > Although other variants may be valid/in cache. > More important, if we stick with the key/data concept it's possible to implement the header/body relationship under single or multiple keys. I think Brian want's mod_cache should be only a layer (glue) between the underlying providers and the cache users. Each set of problems are better dealt under their own layers. The storage layer (cache providers) are going to only worry about storing the key/data pairs (and expiring ?) while the "protocol" layer will deal with the underlying concepts of each protocol (mod_http_cache). The current design leads to bloat, just look at mem_cache and disk_cache, both have their own duplicated quirks (serialize/unserialize, et cetera) and need special handling of the headers and file format. Under the new design this duplication will be gone, think that we will assemble the HTTP-specific part and generalize the storage part. -- Davi Arnaut
Re: Possible new cache architecture
William A. Rowe, Jr. wrote: And, of course, inserting the hit once it's composed is important, and can happen in parallel (3 clients looking for the same, and then fetching the same page from the origin). But it's harmless if the insertion is mutex protected, and the insertion can only happen once the page is fetched complete. in the case of mod_disk_cache the way I would do it is to have a deterministic tempfile rather than user apr_tempfile and opening it EXCL. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Brian Akins wrote: Graham Leggett wrote: That's two hits to find whether something is cached. You must have two hits if you support vary. Well, one to three hits. One, if you use an arbitrary page (MRU or most frequently referenced would be most optimial, but it really doesn't matter) and then determine what varies, and if you are in the right place, or what that right place is (page by language, or whatever fields it varied by.) Three hits or more if your variant also varies ;) How are races prevented? shouldn't be any. something is in the cache or not. if one "piece" of an http "object" is not valid or in cache, the object is invalid. Although other variants may be valid/in cache. And, of course, inserting the hit once it's composed is important, and can happen in parallel (3 clients looking for the same, and then fetching the same page from the origin). But it's harmless if the insertion is mutex protected, and the insertion can only happen once the page is fetched complete.
Re: Possible new cache architecture
Graham Leggett wrote: That's two hits to find whether something is cached. You must have two hits if you support vary. How are races prevented? shouldn't be any. something is in the cache or not. if one "piece" of an http "object" is not valid or in cache, the object is invalid. Although other variants may be valid/in cache. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Graham Leggett wrote: Or you can avoid this issue entirely by building a generic cache that works with key/subkey/data. and then you have to find a way to bridge the gap between this interface and all the key/value caches that currently exist (memcache being the most popular example). what if mod_http_cache had a way to "record" it's cached objects? It could keep up with the relationships there. Basically, you have a provider that has a few functions that get called whenever mod_http_cache caches or expires an object. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Brian Akins wrote: Nope. Look at the way the current http cache works. An http "object," headers and data, is only valid if both headers and data are valid. That's two hits to find whether something is cached. How are races prevented? Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
Graham Leggett wrote: the independent caching of variants. The example I posted should address this issue. I also have some ideas concerning the thundering herd problem, it's just a matter if you think it should be handled in cache or http_cache. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Davi Arnaut wrote: It's a design flaw to create problems that have to be specially coded around, when you can avoid the problem entirely. Maybe I'm missing something, what problems do you foresee ? There are lots of issues that were uncovered when I split the proxy and cache code for httpd v2.0. A web cache requires two separately alterable cached entities (headers, body) just for caching a single variant. This pair of entities need to expire and/or be forceably expired (think Cache-Control no-cache) atomically. Sure, you can code and debug a lot of code to try and create the effect of atomically expiring multiple cache entries at once. Or you can avoid this issue entirely by building a generic cache that works with key/subkey/data. There are a number of other issues that have been listed as bugs since httpd v1.3 that are still present, most notably the thundering herd problem, and the independent caching of variants. There is no point in refactoring the cache code if the new code isn't going to be significantly better than the existing code. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
Davi Arnaut wrote: This way it would be possible for one cache to act as a cache of another cache provider, mod_mem_cache would work as a small/fast MRU cache for mod_disk_cache. Slightly off subject, but in my testing, mod_disk_cache is much faster than mod_mem_cache. Thanks to sendifle! I was thinking about scenarios were each cache had there local cache (disk, mem, whatever) with memcache behind it. That way each "object" only has to be generated once for the entire "farm." This would be an easy way to have a distributed cache. Also, the squid type htcp (or icp) could be a failback for the local cache as well without mucking up all the proxy and cache code. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
On Mon, 01 May 2006 09:02:31 -0400 Brian Akins <[EMAIL PROTECTED]> wrote: > Here is a scenario. We will assume a cache "hit." I think the usage scenario is clear. Moving on, I would like to able to stack up the cache providers (like the apache filter chain). Basically, mod_cache will expose the functions: add(key, value, expiration, flag) get(key) remove(key) mod_cache will then pass the request (add/get or remove) down the chain, similar to apache filter chain. ie: apr_status_t mem_cache_get_filter(ap_cache_filter_t *f, apr_bucket_brigade *bb, ...); apr_status_t disk_cache_get_filter(ap_cache_filter_t *f, apr_bucket_brigade *bb, ...); This way it would be possible for one cache to act as a cache of another cache provider, mod_mem_cache would work as a small/fast MRU cache for mod_disk_cache. -- Davi Arnaut
Re: Possible new cache architecture
On Mon, 01 May 2006 14:51:53 +0200 Graham Leggett <[EMAIL PROTECTED]> wrote: > Davi Arnaut wrote: > > >> mod_cache need not be HTTP specific, it only needs the ability to cache > >> multiple entities (data, headers) under the same key, and be able to > >> replace zero or more entities independently of the other entities (think > >> updating headers without updating content). > > > > mod_cache needs only to cache key/value pairs. The key/value format is up to > > the mod_cache user. > > It's a design flaw to create problems that have to be specially coded > around, when you can avoid the problem entirely. Maybe I'm missing something, what problems do you foresee ? > The cache needs to be generic, yes - but there is no need to stick to > the "key/value" cliché of cache code, if a variation to this is going to > make your life significantly easier. > And the variation is..? -- Davi Arnaut
Re: Possible new cache architecture
Here is a scenario. We will assume a cache "hit." Client asks for http://domain/uri.html?args mod_http_cache generates a key: http-domain-uri.html-args-header asks mod_cache for value with this key. mod_cache fetches the value, looks at expire time, its good, and returns the "blob" mod_http_cache examines blob, it's vary information on Accept-Encoding. mod_http_cache generates a new key: http-domain.html-args-header-gzip (value from client) asks mod_cache for value with this key. mod_cache fetches the value, looks at expire time, its good, and returns the "blob" mod_http_cache examines blob, it's a normal header blob. does not "meet conditions" need to get data. mod_http_cache generates a new key: http-domain.html-args-data-gzip (value from client) asks mod_cache for value with this key. mod_cache fetches the value, looks at expire time, its good, and returns the "blob" mod_http_cache returns headers and data to client. Notice there is a pattern to this... -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Davi Arnaut wrote: mod_cache needs only to cache key/value pairs. The key/value format is up to the mod_cache user. correct. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Davi Arnaut wrote: mod_cache need not be HTTP specific, it only needs the ability to cache multiple entities (data, headers) under the same key, and be able to replace zero or more entities independently of the other entities (think updating headers without updating content). mod_cache needs only to cache key/value pairs. The key/value format is up to the mod_cache user. It's a design flaw to create problems that have to be specially coded around, when you can avoid the problem entirely. The cache needs to be generic, yes - but there is no need to stick to the "key/value" cliché of cache code, if a variation to this is going to make your life significantly easier. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
Graham Leggett wrote: The potential danger with this is for race conditions to happen while expiring cache entries. If the data entity expired before the header entity, it potentially could confuse the cache - is the entry cached or not? The headers say yes, data says no. Nope. Look at the way the current http cache works. An http "object," headers and data, is only valid if both headers and data are valid. Each variant should be an independent cached entry, the cache should allow different variants to be cached side by side. Yes. Each is distinguished by its key. As far as mod_cache is concerned these are 3 independent entries, but mod_http_cache knows how to "stitch" them together. mod_cache should *not* be HTTP specific in any way. mod_cache need not be HTTP specific, it only needs the ability to cache multiple entities (data, headers) under the same key, No. In other words, there must be the ability to cache by a key and a subkey. No. mod_http_cache generates new keys for headers (key.header) data (key.data) and each variant (key1.header, key2.header, key1.daya... etc.). As far as the underlying generic cache is concerned, they are all independent entries. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
On Sun, 30 Apr 2006 22:38:23 +0200 Graham Leggett <[EMAIL PROTECTED]> wrote: > Brian Akins wrote: > > > mod_http_cache could just cache headers and data as separate cache entries. > > The potential danger with this is for race conditions to happen while > expiring cache entries. If the data entity expired before the header > entity, it potentially could confuse the cache - is the entry cached or > not? The headers say yes, data says no. If both the data and header have the same expiration time they should both be removed atomically, but this would be hard to achieve. The trick is to set the header to expire before the data. Also, this would confuse the cache user not the cache itself. > > So a given HTTP "object" may actually have 3 entries in the cache: > > -first entry says: Vary on x,y,z > > -second entry is headers for new key (generated with the vary info) > > -third entry is the actual data > > Each variant should be an independent cached entry, the cache should > allow different variants to be cached side by side. > > > As far as mod_cache is concerned these are 3 independent entries, but > > mod_http_cache knows how to "stitch" them together. > > > > mod_cache should *not* be HTTP specific in any way. > > mod_cache need not be HTTP specific, it only needs the ability to cache > multiple entities (data, headers) under the same key, and be able to > replace zero or more entities independently of the other entities (think > updating headers without updating content). > mod_cache needs only to cache key/value pairs. The key/value format is up to the mod_cache user. -- Davi Arnaut
Re: Possible new cache architecture
Brian Akins wrote: mod_http_cache could just cache headers and data as separate cache entries. The potential danger with this is for race conditions to happen while expiring cache entries. If the data entity expired before the header entity, it potentially could confuse the cache - is the entry cached or not? The headers say yes, data says no. So a given HTTP "object" may actually have 3 entries in the cache: -first entry says: Vary on x,y,z -second entry is headers for new key (generated with the vary info) -third entry is the actual data Each variant should be an independent cached entry, the cache should allow different variants to be cached side by side. As far as mod_cache is concerned these are 3 independent entries, but mod_http_cache knows how to "stitch" them together. mod_cache should *not* be HTTP specific in any way. mod_cache need not be HTTP specific, it only needs the ability to cache multiple entities (data, headers) under the same key, and be able to replace zero or more entities independently of the other entities (think updating headers without updating content). In other words, there must be the ability to cache by a key and a subkey. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
Graham Leggett wrote: A question to ponder is just how generic should the cache be. An HTTP cache requires cache entries containing data and headers, either of which can be updated separately. So any given HTTP "object" would actually be two objects in the cache: headers and data. As a result, the typical "cache a blob of data" interface isn't going to work, and needs to be kept in mind when looking at the cache interfaces. mod_http_cache could just cache headers and data as separate cache entries. So a given HTTP "object" may actually have 3 entries in the cache: -first entry says: Vary on x,y,z -second entry is headers for new key (generated with the vary info) -third entry is the actual data As far as mod_cache is concerned these are 3 independent entries, but mod_http_cache knows how to "stitch" them together. mod_cache should *not* be HTTP specific in any way. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Brian Akins wrote: The components: mod_http_cache: what mod_cache is currently mod_cache: a generic caching module - provides glue between providers and other modules. Think mod_dbd... cache providers: disk, mem, memcache, mysql, etc. This sounds like a refactoring job, which is a good idea. I think step one would be to rename mod_cache to be mod_http_cache as you suggest, then create a blank mod_cache, followed by some refactoring of the generalised methods and cache hooks into mod_cache. I think this exercise should uncover what needs to move, and what needs to be changed. A question to ponder is just how generic should the cache be. An HTTP cache requires cache entries containing data and headers, either of which can be updated separately. As a result, the typical "cache a blob of data" interface isn't going to work, and needs to be kept in mind when looking at the cache interfaces. Regards, Graham -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Possible new cache architecture
> Brian Akins wrote: > Some functions a provider should provide: > init(args...) - initialize an instance :) > open(instance, key) - open a cache object > read_buffer(object, buffer, copy) - read entire object into buffer. > buffer may be read only (ie, it may be mmapped or part of sql statement) > or make it a copy. > read_bb(object, brigade, copy) - read object into a brigade. copy if > flag is set > store_bb(object, brigade) - store a bucket brigade > store_buffer(object, buffer) - store a blob of data > close(object) > > Thoughts? I'm sure we may need more/better cache provider functions. it would be helpful if provider can notify the mod_cache (using some sort of call back function ) when it is removing an object from its cache. So that mod_cache can take a look at the object being removed and decide to push it to the next-less-resource-critical provider. So if mem_cache_provider decides to remove the lru object, mod_cache can push it to disk_cache_provider.
Re: Possible new cache architecture
On Thursday 27 April 2006 15:04, Brian Akins wrote: > The components: How would this fit with the various half-HTTP cacheing standards floating around, and the SoC projects that have been mooted? It seems to me that cache is ripe for generalisation. > mod_http_cache: what mod_cache is currently > mod_cache: a generic caching module - provides glue between providers > and other modules. Think mod_dbd... ... or mod_proxy ... -- Nick Kew
Re: Possible new cache architecture
Bart van der Schans wrote: One thing about the current implementation. Mod_cache does server side caching, but also set expires headers witch trigger client (browser) caching. Right now you can't turn off setting the expires header with mod_cache. I think it would be nice to have an option to configure this. WDYT? Yes. that should be configurable. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: Possible new cache architecture
Brian Akins wrote: > The components: > mod_http_cache: what mod_cache is currently One thing about the current implementation. Mod_cache does server side caching, but also set expires headers witch trigger client (browser) caching. Right now you can't turn off setting the expires header with mod_cache. I think it would be nice to have an option to configure this. WDYT? Bart -- Hippo Oosteinde 11 1017WT Amsterdam The Netherlands Tel +31 (0)20 5224466 - [EMAIL PROTECTED] / http://www.hippo.nl --
Re: Possible new cache architecture
Brian Akins wrote: mod_cache: a generic caching module - provides glue between providers The more I think about it, this part doesn't even need to be httpd specific. It could be apr_cache. Not sure how that would scre things up. I also noticed that the whole providers thing is httpd and not apr... -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Possible new cache architecture
The components: mod_http_cache: what mod_cache is currently mod_cache: a generic caching module - provides glue between providers and other modules. Think mod_dbd... cache providers: disk, mem, memcache, mysql, etc. An example mod_http_cache: -generate cache key -ask mod_cache for object with key -mod_cache checks provider(s) and returns object on "hit" -object may contain vary info, regenerate key and ask mod_cache with new key (this would be equivalent to header) -ask mod_cache for the body -serve data to client This would remove all the HTTP specific stuff from the cache providers and Vary could be handled in a central location (mod_http_cache). And it *should* be fairly trivial to write and stack cache providers. Some functions a provider should provide: init(args...) - initialize an instance :) open(instance, key) - open a cache object read_buffer(object, buffer, copy) - read entire object into buffer. buffer may be read only (ie, it may be mmapped or part of sql statement) or make it a copy. read_bb(object, brigade, copy) - read object into a brigade. copy if flag is set store_bb(object, brigade) - store a bucket brigade store_buffer(object, buffer) - store a blob of data close(object) Thoughts? I'm sure we may need more/better cache provider functions. -- Brian Akins Lead Systems Engineer CNN Internet Technologies