I am using Apache 2.2.9 on Linux AMD64, built from source. There is one server running two builds of Apache - a lightweight front-end caching reverse proxy configuration using mod_disk_cache, and a heavyweight mod_perl back end. I use caching to relieve load on the server when many people request the same page at once. The website is dynamic and contains millions of page permutations. Thus the cache has a tendency to get fairly large, unless it is pruned. So I have been trying to use htcacheclean to achieve this. There have been some issues, which I will outline below.

First, I found that htcacheclean was not able to keep up with pruning the cache. It just kept growing. I initially ran htcacheclean in daemon mode, thus:

htcacheclean -i -t -n -d60 -p/var/cache/www -l1000M

CacheDirLevels was 3 and CacheDirLength 1.

The cache would just keep getting bigger, to multiple GB. Additionally, even doing a du on the cache could take hours to complete.

I also noticed that iowait would spike when I tried running htcacheclean in non-daemon mode. It would not keep up at all using the -n ("nice") option; when I took that off, the iowait would go through the roof and the process would take hours to complete. This was on a quad core AMD64 server with 4 x 10k SCSI drives in hardware RAID0.

Upon investigation, I discovered that the cache was a lot deeper than I expected. In addition to the three levels specified in CacheDirLevels, there were then additional levels of subdirectories beneath ".vary" subdirs. For each .header file, there was a .vary subdir with three levels of directory below that. Simply traversing this tree with du could take a long time - hours sometimes, depending on how long the server had been running without a cache clear.

I discovered that the .vary subdirs were caused by my configuration, which was introducing a Vary http header. This came from two sources: First, mod_deflate. I found this out from this helpful page:

http://www.digitalsanctuary.com/tech-blog/general/apache-mod_deflate-and-mod_cache-issues.html

So I disabled mod_deflate, since it seemed to be producing a huge number of cache entries for each file - a different one for every browser. But after disabling mod_deflate, the .vary subdirs were still there. I also had this line in my config:

Header add Vary "Cookie"

This is necessary because users on my site set options for how the site is displayed. When I tried disabling this cookie Vary header, the number of directories went down substantially, to the expected three levels. The cache structure was much simpler, and it seemed that htcacheclean could keep up with this. However, the site was broken - since the same page for different users with different options would be cached only once. So someone who had "no ads" or "no pics" would request a page that someone else had recently requested (with different options), and they would get that other person's options. Not good. So I had to switch the vary header for cookies back on, so that pages would get differentiated in the cache based on cookie. But now I was back to square one - six effective levels of subdirectory, which htcacheclean could not keep up with.

After some thought, I ended up changing CacheDirLevels to 2, to try to reduce the depth of the tree. Now I had fewer subdirs, but more files in each one.

Also, the size of the cache, via du, always seems to be much higher than specified for htcacheclean. I lowered the limit to 100M, but still the cache is regularly up at 180MB or 200MB. This seems counter-intuitive, since htcacheclean doesn't appear to be taking the true size of the cache into account (i.e. including all the subdirs, which also take up space and presumably are what cause the discrepancy).

I also noticed something else: htcacheclean was leaving behind .header files. When it cleaned the .vary subdirs, it seemed to leave behind the corresponding .header files. These would accumulate, causing the iowait to gradually increase, presumably due to the size of the directories. I would rotate (clear) the cache manually at midnight. The behavior I would see (via munin monitoring tool) was that iowait would then remain at zero for about 12 hours, but then would gradually become visible as the number of .header files would accumulate.

So I wrote a perl script which could go through the cache, and look for .header files, and for each one found, see if a corresponding .vary subdir exists for it. If not, then the .header file is deleted. I then run another script to prune empty subdirectories. Currently I run this combination every 10 minutes - first a non-daemon invocation of htcacheclean, followed by the header prune script, followed by the empty subdirs pruning script. This seems to keep the cache small, and iowait is not noticeable any more, since the "junk" .header files are now disposed of regularly.

However, I'm not sure why I need to run this kind of hacked up bespoke version of cache management, when htcacheclean should surely be capable of doing the job itself.

All of this brings up a few questions:

1. Why does mod_disk_cache generate six levels of subdirectory when CacheDirLevels is clearly set to 3? I realize what it's trying to do, (each page might have many variations and so those variations must be differentiated by subdir) but the additional levels cause an exponential increase in the number of directories that must be traversed. It seems absurd when this causes trouble for a relatively well-specced server. Since starting this investigation, I have moved to a completely new server, a 4 core Xeon 2.33GHz, with 8 x 10k Raptor SATA drives in hardware RAID10 configuration. The performance is excellent, but when I tried using mod_disk_cache with CacheDirLevels at 3 and cookie Vary headers on, it still could not keep up with pruning. Even simply traversing this kind of structure with du is clearly not scalable. Could we not have the three main levels of directory, but then have a different setting for the number of subdirs below the .vary dirs? Usually there is just one file at the leaf of the .vary subdirs, so having three additional levels seems like a bit of overkill. We should be able to tune the subdir levels to minimize the depth of the cache as makes sense.

2. Why does htcacheclean not keep the cache at the stated size limit? If you say -l100M and then do a du and it says 200M, then that is counter-intuitive, and actually wrong in real terms. It gets worse with the larger caches - when I had 3 levels and cookie Vary headers on, the limit for htcacheclean was 1000M, but the cache would grow to 3GB and up.

3. Why are .header files left over by htcachelean when it has deleted the .vary subdirectory? Is this something like a memory leak, but with files? I would have thought that if the cached content (.data) file has gone away, then why bother keeping the .header file around. It clogs up the cache directory and makes traversing the tree more work. If it's kept for 304 "unchanged" responses then I can understand that, but then why do these files still seem to pile up even after the related page would have clearly expired anyway? Surely better to just delete them when the .vary subdir is deleted. In any case, I didn't notice the .header files being left over when the Vary header was disabled, so I think this might be a straightforward "leak" when using Vary.

4. Will I be causing any potential problems for Apache by my deleting the leftover .header files myself (ones which have no corresponding .vary subdir)? Does that cause apache or htcacheclean to have potential issues if you do this while they are running? If they are junk then I can't see it being a problem, but it's unclear currently if they are actually used or not.

I wasn't sure if I should post this on the dev list, since it seems to be more directed at the developers than other users. But the list guidelines said that "Configuration and support questions should be addressed to a user support group", and this seems to be that, so I'll post it here first.

Thanks for any insights or feedback.

Neil

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
  "   from the digest: users-digest-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

Reply via email to