Re: [PATCH] mod_disk cached fixed

2004-08-05 Thread Brian Akins
[EMAIL PROTECTED] wrote:
I thought you were referring to the scheme you proposed in an
earlier email where you WERE planning on doing just that...
Once again, I probably was just uncelar or, frankly, said what I didn't 
mean :)

Take your logic just a tiny step farther.
I you know EVERYTHING about the data... then you CERTAINLY can/would.
In the situations in which we are speaking of currently, the "origin" 
and the "cache" may indeed be the same box, or one controlled by the 
same person.  In these situations, you should be able to "influence" the 
Vary'ing scheme.


When does a response REALLY (actually) Vary and
why should you have to store tons and tons of
responses all the time?
Like I said, if you know alot about the content, then you know when to 
REALLY vary and when to "cheat."

If you MD5 and/or hard-CRC/cheksum the actual BODY
DATA of a response and it does not differ one iota
from another (earlier) non-expired response to a
request for the same URI... then those 2 'responses'
DO NOT VARY.
True.  But you have to make the decision of which version to server 
based purely on the clients request headers.

Even if you get 25,000 different 'User-Agents' asking
for the same URI... there will most probably only be
a small sub-set of actual RESPONSES coming from the
COS.

Correct.   This how using SetEnvIf to set an environment variable is a 
better solution than varying on the "string value" of the User-Agent header.

Topic for discussion?
Kindling for a flame ware?
Not sure... but you raise an interesting question.
Yes :)
Only the most recent versions of SQUID make any attempt
to support 'Vary:' at all. 

We know a few things about squid.  And what you say is true.  It 
supports it "well enough" for what we use it for.  We serve 2 billion + 
hits a day using Squid.

Note: this does not imply that CNN or Time Warner in any way endorse 
squid or any mentioned product.


It should remain THEIR choice, no matter what.
Yes.  But in the case of caching you own content, as long as you leave 
the Vary headers in tact, it shouldn't matter how you actually handle 
the Vary.

--
Brian Akins
Senior Systems Engineer
CNN Internet Technologies


Re: [PATCH] mod_disk cached fixed

2004-08-05 Thread TOKILEY

> Brian Akins wrote...
>
> >[EMAIL PROTECTED] wrote...
> >
> >
> > > Brian Akins wrote...
> > >
> > > Serving cached content:
> > >
> > > - lookup uri in cache (via md5?).
> > > - check varies - a list of headers to vary on
> > > - caculate new key (md5) based on uri and clients value of these headers
> > > - lookup new uri in cache
> > > - continue as normal
> >
> > Don't forget that you can't just 'MD5' a header from one response and
> > compare it to an 'MD5' value for same header field from another response.
> >
>
> This isn't what I meant.  I mean get the "first-level" key by the md5 of 
> the uri, not the headers.

Ok... fine... but when you wrote this...

"caculate new key (md5) based on uri AND clients value of these headers"

The AND is what got me worried.

I thought you were referring to the scheme you proposed in an
earlier email where you WERE planning on doing just that...

> Brian Akins wrote...
>
> I actually have somewhat of a solution:
>
> URL encode the uri and any vary elements:
> www.cnn.com/index.html?this=that
> Accept-Encoding: gzip
> Cookie: Special=SomeValue
>
> may become:
> 
> www.cnn.com%2Findex.html%3Fthis%3Dthat+Accept-Encoding%3A+gzip+Cookie%3A+Special%3DSomeValue
>
> A very simple hashing function could put this in some directory 
> structure, so the file on disk may be:
>
>/var/cache/apache/00/89/www.cnn.com%2Findex.html%3Fthis%3Dthat+Accept-Encoding%3A+gzip+Cookie%3A+Special%3DSomeValue
>
>Should be pretty fast (?) if the urlencode was effecient.
>
> Brian Akins

Not that this wouldn't acutally WORK under some circumstances
( It might ) but it would qualify as just a 'hack'. It wouldnt'
qualify as a good way to perform RFC standard 'Vary:'.

> Brian Akins also wrote...
> >
> > > BrowserMatch ".*MSIE [1-3]|MSIE [1-5].*Mac.*|^Mozilla/[1-4].*Nav" 
> > no-gzip
> > >
> > > and just "vary" on no-gzip (1 or 0), but this may be hard to do just
> > > using headers...
> >
> > It's not hard to do at all... question would be whether it's ever
> > the 'right' thing to do.
> >
>
> If you know alot about the data you can do this.  In "reverse proxy" 
> mode, you would.

Take your logic just a tiny step farther.

I you know EVERYTHING about the data... then you CERTAINLY can/would.

You have just hit on something that should probably be discussed
further.

The whole reason "Vary:" was even created was so that COS ( Content
Origin Servers ) could tell downstream caches to come back upstream
for a 'freshness' check for reasons OTHER than simple time/date
based 'expiration' rules.

I am not certain but I believe it was actually the whole "User-Agent:"
deal that made it necessary. When it became obvious that different
major release browsers had completely different levels of HTTP
support and the HTML that might work for one would puke on another
then it became necessary to have 'Multi-Variants' of the same
response. I am sure the 'scheme' was intended to ( and certainly
will ) handle all kinds of other situations ( Cookie values would 
be second, I guess ) but IIRC there was no more pressing issue 
for 'Vary:' and 'Multiple Variants of a request to the same URI' 
than to solve the emerging 'User-Agent:' nightmare.

So that's all well and good.

There really SHOULD be a way for any cache to hold 2 different
copies of the same non-expired page for both MSIE and Netscape,
when the only reason to do so is that the HTML that works for
one (still) might not work for the other.

But that leads back to YOUR idea ( concern )...

When does a response REALLY (actually) Vary and
why should you have to store tons and tons of 
responses all the time?

That's easy... when the entire response for the same
URI differs in any way from an earlier ( non-expired )
response to a request for the same URI... only then
does it 'actually Vary'.

If you MD5 and/or hard-CRC/cheksum the actual BODY
DATA of a response and it does not differ one iota
from another (earlier) non-expired response to a
request for the same URI... then those 2 'responses'
DO NOT VARY.

It is only when the RESPONSE DATA itself is 'different'
that it can be said the responses truly 'Vary'.

So here is the deal...

Even if you get 25,000 different 'User-Agents' asking
for the same URI... there will most probably only be
a small sub-set of actual RESPONSES coming from the
COS. It is only THAT sub-set of responses that need
to be stored by a cache and 'associated' with the
different ( Varying ) User-Agents.

So that doesn't mean a (smart) cache needs to store
25,000 variants of the same response... It only needs
to STORE responses that ACTUALLY VARY.

How the sub-sets of 'Varying' responses get 'associated'
with the right set(s) of 'Varying' header field(s)
( ie. User-Agent ) is something that the 'Vary:' scheme
lacks and was not considered in the design.

Topic for discussion?
Kindling for a flame ware?
Not sure... but you raise an interesting question.

> Brian Akins also wrote...
> 
> > [EMAIL PROTECTED] wrote...
> >
> > That's why it (Vary) re

Re: Re: [PATCH] tweak worker MPM "recovery" after pthread_create failure in child

2004-08-05 Thread Jeff Trawick
On Thu, 05 Aug 2004 08:44:34 -0400, Brian Akins <[EMAIL PROTECTED]> wrote:
> Jeff Trawick wrote:
> 
> >I've been working with a user on RedHat {A|E}S 2.1 that is
> >intermittently getting one of these thread create failures as the web
> >server tries to create a new child process to handle increased load.
> >It really sucks that the entire server bails out.
> >
> I think it's the stack size.  Try "ulimit -s 1024" before starting.

So I try "ulimit -s" on my RHAS 2.1 box and it says 8192.  So if this
is related to pthread_create failures, then I gather that:

a) default THREAD stack size is 8MB
b) Apache doesn't normally need nearly that much; 1MB should be sufficient

FWIW, I played with the Apache 2.1 directive ThreadStackSize (does
pthread_attr_setstacksize), and linuxthreads was the only platform
where it didn't seem to make a difference.  It was easy to demonstrate
that it was respected on some other pthread platforms (AIX, Solaris,
and HP-UX).

Thanks!


[PATCH] mod_disk_cache: Use binary header format

2004-08-05 Thread Justin Erenkrantz
This patch fully fleshes out Brian's earlier patch and tries to optimize
the header on-disk format.  The current disk format didn't make any sense.
It also was full of security holes (reading into a 1034-byte char's).  Only
the headers are still CRLF-delimited.  We could go further and replace
ap_scan_script_header_err(), but I think there are other things to sort out
first.  (I believe that the fixes to apr_file_gets() make those calls not
as important to optimize.)
Note this is even extensible - the format field will allow us to hot-deploy
new formats - on a mismatch, it'll overwrite the existing file.  -- justin
* modules/experimental/mod_disk_cache.c: Switch to a binary format for 
header files.

Index: modules/experimental/mod_disk_cache.c
===
RCS file: /home/cvspublic/httpd-2.0/modules/experimental/mod_disk_cache.c,v
retrieving revision 1.56
diff -u -r1.56 mod_disk_cache.c
--- modules/experimental/mod_disk_cache.c   5 Aug 2004 08:27:24 -   1.56
+++ modules/experimental/mod_disk_cache.c   5 Aug 2004 08:55:01 -
@@ -23,6 +23,32 @@
#include  /* needed for unlink/link */
#endif
+/* Our on-disk header format is:
+ *
+ * disk_cache_info_t
+ * entity name (dobj->name)
+ * r->headers_out (delimited by CRLF)
+ * CRLF
+ * r->headers_in (delimited by CRLF)
+ * CRLF
+ */
+#define DISK_FORMAT_VERSION 0
+typedef struct {
+/* Indicates the format of the header struct stored on-disk. */
+int format;
+/* The HTTP status code returned for this response.  */
+int status;
+/* The size of the entity name that follows. */
+apr_size_t name_len;
+/* The number of times we've cached this entity. */
+apr_size_t entity_version;
+/* Miscellaneous time values. */
+apr_time_t date;
+apr_time_t expire;
+apr_time_t request_time;
+apr_time_t response_time;
+} disk_cache_info_t;
+
/*
 * disk_cache_object_t
 * Pointed to by cache_object_t::vobj
@@ -37,12 +63,13 @@
char *datafile;  /* name of file where the data will go */
char *hdrsfile;  /* name of file where the hdrs will go */
char *name;
-apr_time_t version;  /* update count of the file */
apr_file_t *fd;  /* data file */
apr_file_t *hfd; /* headers file */
apr_off_t file_size; /*  File size of the cached data file  */
+disk_cache_info_t disk_info; /* Header information. */
} disk_cache_object_t;
+
/*
 * mod_disk_cache configuration
 */
@@ -187,104 +214,55 @@
 * and written transparent to clients of this module
 */
static int file_cache_recall_mydata(apr_file_t *fd, cache_info *info,
-  disk_cache_object_t *dobj)
+disk_cache_object_t *dobj, request_rec 
*r)
{
apr_status_t rv;
-char urlbuff[1034]; /* XXX FIXME... THIS IS A POTENTIAL SECURITY HOLE 
*/
-int urllen = sizeof(urlbuff);
-int offset=0;
+char *urlbuff;
char * temp;
+disk_cache_info_t disk_info;
+apr_size_t len;

/* read the data from the cache file */
-/* format
- * date SP expire SP count CRLF
- * dates are stored as a hex representation of apr_time_t (number of
- * microseconds since 00:00:00 January 1, 1970 UTC)
- */
-rv = apr_file_gets(&urlbuff[0], urllen, fd);
+len = sizeof(disk_cache_info_t);
+rv = apr_file_read_full(fd, &disk_info, len, &len);
if (rv != APR_SUCCESS) {
return rv;
}
-if ((temp = strchr(&urlbuff[0], '\n')) != NULL) /* trim off new line 
character */
-*temp = '\0';  /* overlay it with the null terminator */
-
-if (!apr_date_checkmask(urlbuff, "  
  ")) {
+if (disk_info.format != DISK_FORMAT_VERSION) {
+ap_log_error(APLOG_MARK, APLOG_ERR, 0, r->server,
+ "cache_disk: URL %s had a on-disk version mismatch",
+ r->uri);
return APR_EGENERAL;
}

-info->date = ap_cache_hex2usec(urlbuff + offset);
-offset += (sizeof(info->date)*2) + 1;
-info->expire = ap_cache_hex2usec(urlbuff + offset);
-offset += (sizeof(info->expire)*2) + 1;
-dobj->version = ap_cache_hex2usec(urlbuff + offset);
-offset += (sizeof(info->expire)*2) + 1;
-info->request_time = ap_cache_hex2usec(urlbuff + offset);
-offset += (sizeof(info->expire)*2) + 1;
-info->response_time = ap_cache_hex2usec(urlbuff + offset);
+/* Store it away so we can get it later. */
+dobj->disk_info = disk_info;
-/* check that we have the same URL */
-rv = apr_file_gets(&urlbuff[0], urllen, fd);
+info->date = disk_info.date;
+info->expire = disk_info.expire;
+info->request_time = disk_info.request_time;
+info->response_time = disk_info.response_time;
+
+/* Note that we could optimize this by conditionally doing the palloc
+ * depending upon the size. */
+urlbuff = apr_palloc(r->po