Andy Rabagliati <[EMAIL PROTECTED]> writes: > On Sat, 06 Sep 2003, Andrew M. Bishop wrote: > > > Andy Rabagliati <[EMAIL PROTECTED]> writes: > > > > > At the moment a sites files, and URL unhash, are kept in a flat > > > directory /var/spool/wwwoffle/http/www.domain.com/*. > > > > > > These directories can get really big, and can take a significant > > > time to open. > > > > > > Can you hash these into subdirectories please, like squid and co.? > > > > For example, I currently have 808 directories for different hosts that > > I have cached files from. Of this total there are 195 directories > > that have only a single URL stored in them (24.4%). Directories with > > 5 URLs or fewer make up 47.3% of them, 10 URLs or fewer is 59.8% and > > 20 URLs or fewer is 75.5%. I don't know what you would consider as a > > large directory that would be slow, but I would guess 256 files is OK. > > This would be 128 URLs which is 97.4% of the directories. This means > > that there are fewer than 3% of the directories that would benefit > > from this change.
> > What size directories do you have that cause problems? > > Hmmm .. > > Out of 2552 site directories, > > % cd /var/spool/wwwoffle/http ; for f in * ; do echo -n $f ; ls $f | wc -l ; done | > sort --key=2 | tail -18 > > us.news1.yimg.com 198 > slashdot.org 220 > www.microsoft.com 220 > www.sfgate.com 222 > www.google.com 236 > www.alsangels.com 288 > images.slashdot.org 336 > images-aud.slashdot.org 366 > www.iol.co.za 374 > ad.za.doubleclick.net 388 > ar.atwola.com 498 > adsrv.iol.co.za 514 > ads.osdn.com 518 > news.google.com 682 > www.csmonitor.com 1200 > a.coza.com 1720 > news.bbc.co.uk 2390 > allafrica.com 4608 You seem to have about the same spread of directory sizes as I did and others have got (noting that you have counted files and not URLs like I did). There are only a few directories with many files that would benefit from the change to multiple sub-directories. Typically the number of sub-directories that you would want to create would be equal to the square root of the number of files for the host (so that the number of sub-directories is about equal to the number of files in each). With such a wide variation in the number of files in a directory it would be difficult to chose this number to work well everywhere. If a fixed number of sub-directories are created everytime that a new directory is created for a host then the hosts with few URLs would lose out. To find a URL now requires two levels of directory search, one to find the sub-directory and one to find the file in it. Also everytime that you examine the contents of a directory you change the access time of it so you need to write to the disk. With sub-directories for each host you will need to make more writes to the disk for each URL that you look up, twice as many in fact. You could only create the sub-directories when you need to write a URL to a sub-directory that doesn't exist. This won't have such a big problem with lots of empty directories, but you add extra complication. You could only create the sub-directories when the host directory reaches a critical size. But then you make life worse since you need to handle both the case where there is sub-directories and the case like now where there are not. > % ls -l /var/spool/wwwoffle/http | sort --key=5 -n | tail -6 > > drwxrwxr-x 2 apache uucp 73728 Sep 7 12:09 news.google.com > drwxrwxr-x 2 apache uucp 77824 Sep 7 12:11 news.bbc.co.uk > drwxrwxr-x 2 apache uucp 77824 Sep 7 12:19 adsrv.iol.co.za > drwxrwxr-x 2 apache uucp 139264 Sep 7 12:28 www.mg.co.za > drwxrwxr-x 2 apache uucp 151552 Sep 7 12:25 allafrica.com > drwxrwxr-x 2 apache uucp 552960 Sep 7 12:19 www.iol.co.za > > This gives the (ext3) directory size, which I believe is a function of the > largest number of files that have ever been there - so it is likely that > there have been many more files in some directories. It does look like there might have been some much bigger directories at some point. It is possible that these huge directory sizes are causing part of the slow down even if there are not a huge number of files. Unless the data in the directory itself is compacted when things are deleted it will need to search through the whole 552 kB to check if a file exists. This directory probably only needs to be about 12 kB (32 bytes per file) rather than 552 kB. You could try reducing the size of the directories by doing: mv www.iol.co.za www.iol.co.za.bak mkdir www.iol.co.za mv www.iol.co.za.bak/* www.iol.co.za rmdir www.iol.co.za.bak > What we like to do, however, is to do a recursive wget of an entire site > (like www.enature.com) over a week or so (in pieces, overnight) and then > direct the class to that site. > > I have long expiry times, and some complete sites. This sort of usage of WWWOFFLE will tend to bring out the worst features. > > What are you doing when you notice that there is the time delay (is it > > creating the host index, or opening a URL from the host or something > > else)? > > Opening a URL from a browser. I hide all the index creation stuff from > the casual user - and run the important ones from a cronjob in the > morning and present it as a static page. I think it could be the huge directory entries that are causing the problem. I don't know for sure, but I would imagine that it is the size of the directory rather than the number of files that would be the key feature. > > What filesystem are you using, reiserfs has all sorts of features to > > speed up directory accesses, perhaps this would help? > > I use ext3. > > My /var partition is separate - for Maildir mail and wwwoffle, so it > would be possible to use reiserfs for this. I have another suggestion as well that I just found by searching my WWWOFFLE cache. The latest Linux kernels have a better ext3 directory as described at http://lwn.net/Articles/39901/ : EXT3. : ~~~~~ : - The ext3 filesystem has gained indexed directory support, which offers : considerable performance gains when used on filesystems with directories : containing large numbers of files. : - In order to use the htree feature, you need at least version 1.32 of : e2fsprogs. : - Existing filesystems can be converted using the command : : tune2fs -O dir_index /dev/hdXXX : : - The latest e2fsprogs can be found at : http://prdownloads.sourceforge.net/e2fsprogs -- Andrew. ---------------------------------------------------------------------- Andrew M. Bishop [EMAIL PROTECTED] http://www.gedanken.demon.co.uk/ WWWOFFLE users page: http://www.gedanken.demon.co.uk/wwwoffle/version-2.7/user.html
