Re: mod_proxy caching documentation

2000-11-16 Thread Joshua Chamas

Dan McCormick wrote:
 
 Hi,
 
 After struggling with trying to figure out mod_proxy's caching algorithm
 and noting from the list archive's that others had, too -- and due to
 the dearth of existing documentation on the subject -- I came up with
 some documentation below by sifting through the source code.  Most of it
 isn't explicitly mod_perl-related, but I hope those trying to set it up

Thanks for the read.  Very enlightening.  I'm guessing
the dir levels matters because it lets the files be
spread over that many more directories, so there isn't a 
large directory hashing penalty on a HUGE number of files.
5 is probably a bit much though if it really creates 4-5
directories for each file it stores, and if you are using
this only for a proxy in reverse mode for mod_perl, its likely
you could get away with 2-3 levels.

I think it would be interesting if you chronicled the capacity 
improvements to your site using the mod_proxy server like this.  
I don't know how well mod_proxy does this caching from a performance
perspective, and it might be nice to see some numbers that
one could later compare with some of the commercial caching
products.

--Joshua

 will find it useful.  Included at the end is a Perl script to determine
 the filename that mod_proxy uses to cache files, which is helpful in
 manually cleaning up the cache.  If anyone has comments or
 suggestions, please let me know.
 
 Thanks,
 Dan
 
 
 
 Setting up Apache with mod_proxy to cache content from a mod_perl server
 
 The documentation for mod_proxy can be found at
 http://httpd.apache.org/docs/mod/mod_proxy.html.  Unfortunately, aside
 from the configuration parameters, not much detail is provided on how to
 set up mod_proxy to cache pages from a downstream server.  This
 explanation hopes to fill that void.  Most of its content was derived by
 going through the mod_proxy.c, proxy_cache.c, and proxy_util.c source
 files and comments in the src/modules/proxy directory of the Apache
 1.3.12 distribution.
 
 * The Short Story
 
 In short, mod_proxy will cache all requests that contain a Last-Modified
 header and an Expires header.  You can insert this into your mod_perl
 scripts with something like this:
 
 use Apache::File ();
 use HTTP::Date;
 
 $r-set_last_modified((stat $r-finfo)[9]); # see Eagle book p. 493 for
 explanation
 $r-header_out('Expires', HTTP::Date::time2str(time + 24*60*60)); #
 expires in one day
 
 The page will live in the cache until the current time passes the time
 defined by the Expires header or the time since the file was cached
 exceeds the CacheMaxExpire parameter as set in the server config file.
 
 * The Long Story
 
 To understand how the caching proxy server works, let's trace the flow
 of two simple HTTP exchanges for the same file, from the browser request
 to the returned page.
 
 - The browser makes a request to the proxy server like this:
 
 GET /index.html HTTP/1.0
 
 - The proxy server takes the URL and converts it to a filename on your
 filesystem.  This filename has no resemblance to the actual URL.
 Instead, it is an MD5 hash of the fully qualified URL (e.g.
 http://www.myserver.com:80/mypage.html) to the document and is broken up
 in a number of directory levels, as defined by the CacheDirLevels
 parameter in the config file.  (WHY DOES IT MATTER HOW MANY DIR LEVELS
 ARE IN THE CACHE?)  Each of these directories will have a certain number
 of characters in its name, as defined by the CacheDirLength parameter in
 the config file.  The directories will live under CacheRoot, also
 defined in the config file.  For example, /index.html might be converted
 to /proxy_cache/m/EYRopVKBHMrHd2VF6WXOQ (with CacheDirLevels and
 CacheDirLength set to 1 and CacheRoot set to /proxy_cache).
 
 - For this example, we'll assume that at this point the cached file does
 not exist.  The proxy server then consequently forwards the request to
 the mod_perl server and gets a response back.  The response will then be
 cached UNLESS any of the following conditions are true
 (ap_proxy_cache_update):
  - The HTTP status returned by the mod_perl server is not one of OK,
 HTTP_MOVED_PERMANENTLY, or HTTP_NOT_MODIFIED
  - The response does not contain an Expires header
  - The response contains an Expires header that Apache can't parse
  - The HTTP status is OK but there's not a Last-Modified header
  - The mod_perl server sent only an HTTP header
  - The mod_perl server sent an Authorization field in the header
 (Furthermore, if any of the above conditions are met, any existing
 cached file will be deleted.)
 
 - If the server decides to cache the file, it will store the file
 exactly as it was received from the mod_perl server, with the addition
 of a one-line header at the start of the file.  This header contains the
 following information in the following format:
 current time last modified time expiration time "version"
 content length
 
 All times are stored as hex seconds since 1970 and 

Re: mod_proxy caching documentation

2000-11-16 Thread barries

On Thu, Nov 16, 2000 at 01:37:41AM -0500, Dan McCormick wrote:
 
 I came up with  some documentation below by sifting through the
 source code.

Excellent, thanks!

If a malformed Expires: prevents mod_proxy from caching a response (

 The response will then be
 cached UNLESS any of the following conditions are true
 (ap_proxy_cache_update):
[snip]
  - The response contains an Expires header that Apache can't parse

), why do they go to some lengths to make up for a malformed one (

 if the Expires time cannot be parsed and
 a Last Modified time exists from the previous step, then the Expires
 time is set to "now + min((date - lastmod) * factor, maxexpire)" (as
 noted in the source code comments)

)?  I'm assuming that it can because that's a bit of extra logic that
wouldn't need to be there otherwise.  Or maybe it's leftover code that
never fires.

I thought (not that I remember why) that it didn't need an Expires: header
and that it would make up a value of it's own based on the .conf settings.

- Barrie