Re: [PATCH] mod_disk cached fixed
[EMAIL PROTECTED] wrote: I thought you were referring to the scheme you proposed in an earlier email where you WERE planning on doing just that... Once again, I probably was just uncelar or, frankly, said what I didn't mean :) Take your logic just a tiny step farther. I you know EVERYTHING about the data... then you CERTAINLY can/would. In the situations in which we are speaking of currently, the "origin" and the "cache" may indeed be the same box, or one controlled by the same person. In these situations, you should be able to "influence" the Vary'ing scheme. When does a response REALLY (actually) Vary and why should you have to store tons and tons of responses all the time? Like I said, if you know alot about the content, then you know when to REALLY vary and when to "cheat." If you MD5 and/or hard-CRC/cheksum the actual BODY DATA of a response and it does not differ one iota from another (earlier) non-expired response to a request for the same URI... then those 2 'responses' DO NOT VARY. True. But you have to make the decision of which version to server based purely on the clients request headers. Even if you get 25,000 different 'User-Agents' asking for the same URI... there will most probably only be a small sub-set of actual RESPONSES coming from the COS. Correct. This how using SetEnvIf to set an environment variable is a better solution than varying on the "string value" of the User-Agent header. Topic for discussion? Kindling for a flame ware? Not sure... but you raise an interesting question. Yes :) Only the most recent versions of SQUID make any attempt to support 'Vary:' at all. We know a few things about squid. And what you say is true. It supports it "well enough" for what we use it for. We serve 2 billion + hits a day using Squid. Note: this does not imply that CNN or Time Warner in any way endorse squid or any mentioned product. It should remain THEIR choice, no matter what. Yes. But in the case of caching you own content, as long as you leave the Vary headers in tact, it shouldn't matter how you actually handle the Vary. -- Brian Akins Senior Systems Engineer CNN Internet Technologies
Re: [PATCH] mod_disk cached fixed
> Brian Akins wrote... > > >[EMAIL PROTECTED] wrote... > > > > > > > Brian Akins wrote... > > > > > > Serving cached content: > > > > > > - lookup uri in cache (via md5?). > > > - check varies - a list of headers to vary on > > > - caculate new key (md5) based on uri and clients value of these headers > > > - lookup new uri in cache > > > - continue as normal > > > > Don't forget that you can't just 'MD5' a header from one response and > > compare it to an 'MD5' value for same header field from another response. > > > > This isn't what I meant. I mean get the "first-level" key by the md5 of > the uri, not the headers. Ok... fine... but when you wrote this... "caculate new key (md5) based on uri AND clients value of these headers" The AND is what got me worried. I thought you were referring to the scheme you proposed in an earlier email where you WERE planning on doing just that... > Brian Akins wrote... > > I actually have somewhat of a solution: > > URL encode the uri and any vary elements: > www.cnn.com/index.html?this=that > Accept-Encoding: gzip > Cookie: Special=SomeValue > > may become: > > www.cnn.com%2Findex.html%3Fthis%3Dthat+Accept-Encoding%3A+gzip+Cookie%3A+Special%3DSomeValue > > A very simple hashing function could put this in some directory > structure, so the file on disk may be: > >/var/cache/apache/00/89/www.cnn.com%2Findex.html%3Fthis%3Dthat+Accept-Encoding%3A+gzip+Cookie%3A+Special%3DSomeValue > >Should be pretty fast (?) if the urlencode was effecient. > > Brian Akins Not that this wouldn't acutally WORK under some circumstances ( It might ) but it would qualify as just a 'hack'. It wouldnt' qualify as a good way to perform RFC standard 'Vary:'. > Brian Akins also wrote... > > > > > BrowserMatch ".*MSIE [1-3]|MSIE [1-5].*Mac.*|^Mozilla/[1-4].*Nav" > > no-gzip > > > > > > and just "vary" on no-gzip (1 or 0), but this may be hard to do just > > > using headers... > > > > It's not hard to do at all... question would be whether it's ever > > the 'right' thing to do. > > > > If you know alot about the data you can do this. In "reverse proxy" > mode, you would. Take your logic just a tiny step farther. I you know EVERYTHING about the data... then you CERTAINLY can/would. You have just hit on something that should probably be discussed further. The whole reason "Vary:" was even created was so that COS ( Content Origin Servers ) could tell downstream caches to come back upstream for a 'freshness' check for reasons OTHER than simple time/date based 'expiration' rules. I am not certain but I believe it was actually the whole "User-Agent:" deal that made it necessary. When it became obvious that different major release browsers had completely different levels of HTTP support and the HTML that might work for one would puke on another then it became necessary to have 'Multi-Variants' of the same response. I am sure the 'scheme' was intended to ( and certainly will ) handle all kinds of other situations ( Cookie values would be second, I guess ) but IIRC there was no more pressing issue for 'Vary:' and 'Multiple Variants of a request to the same URI' than to solve the emerging 'User-Agent:' nightmare. So that's all well and good. There really SHOULD be a way for any cache to hold 2 different copies of the same non-expired page for both MSIE and Netscape, when the only reason to do so is that the HTML that works for one (still) might not work for the other. But that leads back to YOUR idea ( concern )... When does a response REALLY (actually) Vary and why should you have to store tons and tons of responses all the time? That's easy... when the entire response for the same URI differs in any way from an earlier ( non-expired ) response to a request for the same URI... only then does it 'actually Vary'. If you MD5 and/or hard-CRC/cheksum the actual BODY DATA of a response and it does not differ one iota from another (earlier) non-expired response to a request for the same URI... then those 2 'responses' DO NOT VARY. It is only when the RESPONSE DATA itself is 'different' that it can be said the responses truly 'Vary'. So here is the deal... Even if you get 25,000 different 'User-Agents' asking for the same URI... there will most probably only be a small sub-set of actual RESPONSES coming from the COS. It is only THAT sub-set of responses that need to be stored by a cache and 'associated' with the different ( Varying ) User-Agents. So that doesn't mean a (smart) cache needs to store 25,000 variants of the same response... It only needs to STORE responses that ACTUALLY VARY. How the sub-sets of 'Varying' responses get 'associated' with the right set(s) of 'Varying' header field(s) ( ie. User-Agent ) is something that the 'Vary:' scheme lacks and was not considered in the design. Topic for discussion? Kindling for a flame ware? Not sure... but you raise an interesting question. > Brian Akins also wrote... > > > [EMAIL PROTECTED] wrote... > > > > That's why it (Vary) re
Re: Re: [PATCH] tweak worker MPM "recovery" after pthread_create failure in child
On Thu, 05 Aug 2004 08:44:34 -0400, Brian Akins <[EMAIL PROTECTED]> wrote: > Jeff Trawick wrote: > > >I've been working with a user on RedHat {A|E}S 2.1 that is > >intermittently getting one of these thread create failures as the web > >server tries to create a new child process to handle increased load. > >It really sucks that the entire server bails out. > > > I think it's the stack size. Try "ulimit -s 1024" before starting. So I try "ulimit -s" on my RHAS 2.1 box and it says 8192. So if this is related to pthread_create failures, then I gather that: a) default THREAD stack size is 8MB b) Apache doesn't normally need nearly that much; 1MB should be sufficient FWIW, I played with the Apache 2.1 directive ThreadStackSize (does pthread_attr_setstacksize), and linuxthreads was the only platform where it didn't seem to make a difference. It was easy to demonstrate that it was respected on some other pthread platforms (AIX, Solaris, and HP-UX). Thanks!
[PATCH] mod_disk_cache: Use binary header format
This patch fully fleshes out Brian's earlier patch and tries to optimize the header on-disk format. The current disk format didn't make any sense. It also was full of security holes (reading into a 1034-byte char's). Only the headers are still CRLF-delimited. We could go further and replace ap_scan_script_header_err(), but I think there are other things to sort out first. (I believe that the fixes to apr_file_gets() make those calls not as important to optimize.) Note this is even extensible - the format field will allow us to hot-deploy new formats - on a mismatch, it'll overwrite the existing file. -- justin * modules/experimental/mod_disk_cache.c: Switch to a binary format for header files. Index: modules/experimental/mod_disk_cache.c === RCS file: /home/cvspublic/httpd-2.0/modules/experimental/mod_disk_cache.c,v retrieving revision 1.56 diff -u -r1.56 mod_disk_cache.c --- modules/experimental/mod_disk_cache.c 5 Aug 2004 08:27:24 - 1.56 +++ modules/experimental/mod_disk_cache.c 5 Aug 2004 08:55:01 - @@ -23,6 +23,32 @@ #include /* needed for unlink/link */ #endif +/* Our on-disk header format is: + * + * disk_cache_info_t + * entity name (dobj->name) + * r->headers_out (delimited by CRLF) + * CRLF + * r->headers_in (delimited by CRLF) + * CRLF + */ +#define DISK_FORMAT_VERSION 0 +typedef struct { +/* Indicates the format of the header struct stored on-disk. */ +int format; +/* The HTTP status code returned for this response. */ +int status; +/* The size of the entity name that follows. */ +apr_size_t name_len; +/* The number of times we've cached this entity. */ +apr_size_t entity_version; +/* Miscellaneous time values. */ +apr_time_t date; +apr_time_t expire; +apr_time_t request_time; +apr_time_t response_time; +} disk_cache_info_t; + /* * disk_cache_object_t * Pointed to by cache_object_t::vobj @@ -37,12 +63,13 @@ char *datafile; /* name of file where the data will go */ char *hdrsfile; /* name of file where the hdrs will go */ char *name; -apr_time_t version; /* update count of the file */ apr_file_t *fd; /* data file */ apr_file_t *hfd; /* headers file */ apr_off_t file_size; /* File size of the cached data file */ +disk_cache_info_t disk_info; /* Header information. */ } disk_cache_object_t; + /* * mod_disk_cache configuration */ @@ -187,104 +214,55 @@ * and written transparent to clients of this module */ static int file_cache_recall_mydata(apr_file_t *fd, cache_info *info, - disk_cache_object_t *dobj) +disk_cache_object_t *dobj, request_rec *r) { apr_status_t rv; -char urlbuff[1034]; /* XXX FIXME... THIS IS A POTENTIAL SECURITY HOLE */ -int urllen = sizeof(urlbuff); -int offset=0; +char *urlbuff; char * temp; +disk_cache_info_t disk_info; +apr_size_t len; /* read the data from the cache file */ -/* format - * date SP expire SP count CRLF - * dates are stored as a hex representation of apr_time_t (number of - * microseconds since 00:00:00 January 1, 1970 UTC) - */ -rv = apr_file_gets(&urlbuff[0], urllen, fd); +len = sizeof(disk_cache_info_t); +rv = apr_file_read_full(fd, &disk_info, len, &len); if (rv != APR_SUCCESS) { return rv; } -if ((temp = strchr(&urlbuff[0], '\n')) != NULL) /* trim off new line character */ -*temp = '\0'; /* overlay it with the null terminator */ - -if (!apr_date_checkmask(urlbuff, " ")) { +if (disk_info.format != DISK_FORMAT_VERSION) { +ap_log_error(APLOG_MARK, APLOG_ERR, 0, r->server, + "cache_disk: URL %s had a on-disk version mismatch", + r->uri); return APR_EGENERAL; } -info->date = ap_cache_hex2usec(urlbuff + offset); -offset += (sizeof(info->date)*2) + 1; -info->expire = ap_cache_hex2usec(urlbuff + offset); -offset += (sizeof(info->expire)*2) + 1; -dobj->version = ap_cache_hex2usec(urlbuff + offset); -offset += (sizeof(info->expire)*2) + 1; -info->request_time = ap_cache_hex2usec(urlbuff + offset); -offset += (sizeof(info->expire)*2) + 1; -info->response_time = ap_cache_hex2usec(urlbuff + offset); +/* Store it away so we can get it later. */ +dobj->disk_info = disk_info; -/* check that we have the same URL */ -rv = apr_file_gets(&urlbuff[0], urllen, fd); +info->date = disk_info.date; +info->expire = disk_info.expire; +info->request_time = disk_info.request_time; +info->response_time = disk_info.response_time; + +/* Note that we could optimize this by conditionally doing the palloc + * depending upon the size. */ +urlbuff = apr_palloc(r->po