On Sat 06 Sep 2003, Andrew M. Bishop wrote: > > When I tried squid (a long time ago, it might have changed now) there > were no directories per host, instead there was an enormous set of > pre-created directories. These directories were then filled up at > random as new files were cached so that there are equal probabilities > of a new file going into any directory.
When I do configure squid (usually for sites that have a fast broadband connection), I change the defaults to have 32 x 32 directories, instead of the 16 x 256 (or whatever) that's the default; makes it all a little bit more acceptable :-) > Like you say the WWWOFFLE cache is different because there is a > directory for each site that contains all the files for that site. Personally, I prefer WWWOFFLE's way. > The problem with the change that you are suggesting is that there is a > wide variation in the number of files that are stored in any > directory. You could of course contemplate making x subdorectories below each host's directory, but whether that's actually worth the effort... > For example, I currently have 808 directories for different hosts that For the statistics, I have 784 dirs, with 765 having less than 256 files. Those that have more have been online photo sites (them thumbnails add up), images.google.com for example, and an ads site (that of course is a candidate for the "DontFetch" section :-). And www.hyperpro.com (I was looking for a replacement shock absorber for my motorcycle, apparently that site uses way too many distinct images in its layout). I have two directories with zero files (?), 210 (26.7%) with one URL, 509 (64.9%) with five or less, 584 (74.4%). The distribution seems pretty constant :-) On reasonable hardware these days, a couple of thousand files in a directory doesn't pose any real problem (at least, using a filesystem designed in the last ten years or so). When I do time find /var/cache/wwwoffle/http -name bla (that's almost 40000 files altogether) I get: real 0m0.169s user 0m0.040s sys 0m0.050s Note that this is while there's an updatedb going on in the background at the same time (which starts at 00:15) so the disks are being hammered and the spooldir wasn't cached. This is on a linux ext3 filesystem. Reiserfs should give pretty constant performance regardless of the number of files in a single directory; that was one of the objectives of reiserfs. If you're really seeing a slowdown due to the number of files in one directory, I recommend you think your configuration through a bit... Splitting the cache up into multiple (sub)directories probably will not help as much as you'd expect. Once the directory for one host is cached (as will happen when accessing the page of one host) further accesses to the cache files should be extremely fast. In short, you need to supply more concrete data about your perceived problem... Paul Slootman
