Dan McCormick wrote:
Hi,
After struggling with trying to figure out mod_proxy's caching algorithm
and noting from the list archive's that others had, too -- and due to
the dearth of existing documentation on the subject -- I came up with
some documentation below by sifting through the source code. Most of it
isn't explicitly mod_perl-related, but I hope those trying to set it up
Thanks for the read. Very enlightening. I'm guessing
the dir levels matters because it lets the files be
spread over that many more directories, so there isn't a
large directory hashing penalty on a HUGE number of files.
5 is probably a bit much though if it really creates 4-5
directories for each file it stores, and if you are using
this only for a proxy in reverse mode for mod_perl, its likely
you could get away with 2-3 levels.
I think it would be interesting if you chronicled the capacity
improvements to your site using the mod_proxy server like this.
I don't know how well mod_proxy does this caching from a performance
perspective, and it might be nice to see some numbers that
one could later compare with some of the commercial caching
products.
--Joshua
will find it useful. Included at the end is a Perl script to determine
the filename that mod_proxy uses to cache files, which is helpful in
manually cleaning up the cache. If anyone has comments or
suggestions, please let me know.
Thanks,
Dan
Setting up Apache with mod_proxy to cache content from a mod_perl server
The documentation for mod_proxy can be found at
http://httpd.apache.org/docs/mod/mod_proxy.html. Unfortunately, aside
from the configuration parameters, not much detail is provided on how to
set up mod_proxy to cache pages from a downstream server. This
explanation hopes to fill that void. Most of its content was derived by
going through the mod_proxy.c, proxy_cache.c, and proxy_util.c source
files and comments in the src/modules/proxy directory of the Apache
1.3.12 distribution.
* The Short Story
In short, mod_proxy will cache all requests that contain a Last-Modified
header and an Expires header. You can insert this into your mod_perl
scripts with something like this:
use Apache::File ();
use HTTP::Date;
$r-set_last_modified((stat $r-finfo)[9]); # see Eagle book p. 493 for
explanation
$r-header_out('Expires', HTTP::Date::time2str(time + 24*60*60)); #
expires in one day
The page will live in the cache until the current time passes the time
defined by the Expires header or the time since the file was cached
exceeds the CacheMaxExpire parameter as set in the server config file.
* The Long Story
To understand how the caching proxy server works, let's trace the flow
of two simple HTTP exchanges for the same file, from the browser request
to the returned page.
- The browser makes a request to the proxy server like this:
GET /index.html HTTP/1.0
- The proxy server takes the URL and converts it to a filename on your
filesystem. This filename has no resemblance to the actual URL.
Instead, it is an MD5 hash of the fully qualified URL (e.g.
http://www.myserver.com:80/mypage.html) to the document and is broken up
in a number of directory levels, as defined by the CacheDirLevels
parameter in the config file. (WHY DOES IT MATTER HOW MANY DIR LEVELS
ARE IN THE CACHE?) Each of these directories will have a certain number
of characters in its name, as defined by the CacheDirLength parameter in
the config file. The directories will live under CacheRoot, also
defined in the config file. For example, /index.html might be converted
to /proxy_cache/m/EYRopVKBHMrHd2VF6WXOQ (with CacheDirLevels and
CacheDirLength set to 1 and CacheRoot set to /proxy_cache).
- For this example, we'll assume that at this point the cached file does
not exist. The proxy server then consequently forwards the request to
the mod_perl server and gets a response back. The response will then be
cached UNLESS any of the following conditions are true
(ap_proxy_cache_update):
- The HTTP status returned by the mod_perl server is not one of OK,
HTTP_MOVED_PERMANENTLY, or HTTP_NOT_MODIFIED
- The response does not contain an Expires header
- The response contains an Expires header that Apache can't parse
- The HTTP status is OK but there's not a Last-Modified header
- The mod_perl server sent only an HTTP header
- The mod_perl server sent an Authorization field in the header
(Furthermore, if any of the above conditions are met, any existing
cached file will be deleted.)
- If the server decides to cache the file, it will store the file
exactly as it was received from the mod_perl server, with the addition
of a one-line header at the start of the file. This header contains the
following information in the following format:
current time last modified time expiration time "version"
content length
All times are stored as hex seconds since 1970 and