Ruediger Pluem wrote:
What information do your cookies contain? Are these session cookies that
are individual to each client? In this case the usage of mod_disk_cache
with Vary Cookies set would be bad. As these responses would be individual
you couldn't reuse the results anyway for other clients, so it would be
the best to leave caching to the individual client caches (e.g. browser caches).
If your cookies are like BACKGROUND=blue for some users and BACKGROUND=red
for other users you should think of incorporating these differences into
the URL's instead of into varying responses.

I use two cookies currently - one for user logins and one for options. They are independent - people browsing the site may have either, or both, or neither set.

I need to cache all dynamically generated content so that the server can cope with slashdottings and links from other popular sites where lots of people all click on the same link at the same time ("click storms"). Such links could go to any page on the site, and so I really need to cache almost everything from mod_perl - with the exception of areas of the site which are obviously user-specific, such as edit forms, users' personal pages and so on. Those are no-cache.

I am very careful about setting expiration times, since with it being a dynamic site and all, you don't want too many stale pages. So many of the indexes (e.g. list of latest journal updates) have an expiration of only 1-3 minutes, while other journal pages have expiration of 12 hours or more.

I keep a 'version' field as part of the database records for most content on the site, which is incremented whenever an object is edited. Then when someone edits a journal, I include a special 'v=xxx' parameter in subsequent links to pages on that journal, to differentiate it from earlier versions. So the links from the (fast expiring) index pages such as forums or journals index will quickly have the new link with the new version. This allows me to have extensively cached content while still having people see new edits quickly. Thus the cache is fairly high turnover.

The mod_disk_cache works very well, the only issue being keeping the cache size under control without making iowait become noticable as a result. I have been finding that keeping the limit down to 100M rather than 1000M, and making DirCacheLevels 2 rather than 3, and clearing out the orphaned .header files, and running htcacheclean and my header pruning script every 10 minutes, seems to make the server very comfortable - the iowait goes away to unnoticeable levels.

All the app level code here was developed by me. This is a community website for bicycle touring journals - It currently sees somewhere north of 100,000 page requests per day, according to analog (and that's not including googlebot, which is on there constantly). I am very interested in configuring the site to be able to run efficiently on one reasonably well-spec'd server. Caching dynamic content is a major part of being able to scale well to cope with click storms.

Regarding the performance you should take a look at the following:

1. Use a separate filesystem for the cache.
2. Ensure that it is mounted with noatime option.
3. Check if you are using the right type of filesystem for this job. If the
   size of the individual cache files is rather small reiserfs can be much
   faster then ext3 if I remember correctly.

I currently use ext2 with noatime for the main filesystem (including cache). I went to ext2 from ext3 because ext3 has extra overhead related to keeping the journal (I believe that is the big difference between the two these days). Though I do not have numbers, I do seem to have seen disk performance increase since going back to ext2. I'm not sure if you can install dir_index with ext2 without turning it into ext3 in the process, but in any case I don't have dir_index enabled currently.

I was aware of the potential for using other filesystems for the cache, and had thought about reiserfs as a possibility. However after I wrote to the httpd users list a few weeks back asking about this very issue, I got zero responses. I then went to the squid group and asked there too, and similarly got zero useful responses. I agree that reiserfs might handle many small files better, but I am wary of using that since the trial of Hans Reiser - it kind of calls the future of his tool into question, unfortunately.

2. Why does htcacheclean not keep the cache at the stated size limit? If
you say -l100M and then do a du and it says 200M, then that is
counter-intuitive, and actually wrong in real terms. It gets worse with
the larger caches - when I had 3 levels and cookie Vary headers on, the
limit for htcacheclean was 1000M, but the cache would grow to 3GB and up.

Again, this is an issue with the documentation. In fact htcacheclean does
not limit the size of the cache at all. It can grow indefinitely.
It only ensures that the size of the cache is being reduced back at least
to the given limit after it ran. The size of the cache is defined as the
sum of all filesizes in the cache. It does not consider the disk usage of
these files which can be larger and it also doesn't take the sizes of the
directories into account. I am not sure if a du like measurement of the
cache size would be implementable in a platform independent way, but I
may be wrong here.

Ok, that's fine. You're right, it sounds like a documentation issue.

This seems to be a bug. Can you please try if the following patch fixes this?

I applied the patch and rebuilt httpd_proxy successfully. The new htcacheclean runs ok, but still seems to leave behind the orphan .header files. At least, I tried running htcacheclean in single run mode, thus:

htcacheclean -t -p/var/cache/www -l100M

Then I run my prune_cache_headers perl script, and it seems to still find a bunch of orphaned .header files to delete. So it doesn't appear to have fixed the issue. I did confirm that the patch was applied.

4. Will I be causing any potential problems for Apache by my deleting
the leftover .header files myself (ones which have no corresponding
.vary subdir)? Does that cause apache or htcacheclean to have potential
issues if you do this while they are running? If they are junk then I
can't see it being a problem, but it's unclear currently if they are
actually used or not.

IMHO not. The patch above does the same.

Great, thanks - good to know.

Thanks for your help!


