On Sat 06 Sep 2003, Andrew M. Bishop wrote:
> 
> When I tried squid (a long time ago, it might have changed now) there
> were no directories per host, instead there was an enormous set of
> pre-created directories.  These directories were then filled up at
> random as new files were cached so that there are equal probabilities
> of a new file going into any directory.

When I do configure squid (usually for sites that have a fast broadband
connection), I change the defaults to have 32 x 32 directories, instead
of the 16 x 256 (or whatever) that's the default; makes it all a little
bit more acceptable :-)

> Like you say the WWWOFFLE cache is different because there is a
> directory for each site that contains all the files for that site.

Personally, I prefer WWWOFFLE's way.

> The problem with the change that you are suggesting is that there is a
> wide variation in the number of files that are stored in any
> directory.

You could of course contemplate making x subdorectories below each
host's directory, but whether that's actually worth the effort...


> For example, I currently have 808 directories for different hosts that

For the statistics, I have 784 dirs, with 765 having less than 256
files. Those that have more have been online photo sites (them
thumbnails add up), images.google.com for example, and an ads site (that
of course is a candidate for the "DontFetch" section :-). And
www.hyperpro.com (I was looking for a replacement shock absorber for my
motorcycle, apparently that site uses way too many distinct images in
its layout).

I have two directories with zero files (?), 210 (26.7%) with one URL,
509 (64.9%) with five or less, 584 (74.4%).  The distribution seems
pretty constant :-)

On reasonable hardware these days, a couple of thousand files in a
directory doesn't pose any real problem (at least, using a filesystem
designed in the last ten years or so). When I do
time find /var/cache/wwwoffle/http -name bla
(that's almost 40000 files altogether) I get:
real    0m0.169s
user    0m0.040s
sys     0m0.050s
Note that this is while there's an updatedb going on in the background
at the same time (which starts at 00:15) so the disks are being hammered
and the spooldir wasn't cached.

This is on a linux ext3 filesystem. Reiserfs should give pretty constant
performance regardless of the number of files in a single directory;
that was one of the objectives of reiserfs.

If you're really seeing a slowdown due to the number of files in one
directory, I recommend you think your configuration through a bit...
Splitting the cache up into multiple (sub)directories probably will not
help as much as you'd expect. Once the directory for one host is cached
(as will happen when accessing the page of one host) further accesses to
the cache files should be extremely fast.

In short, you need to supply more concrete data about your perceived
problem...


Paul Slootman

Reply via email to