On Sun, 07 Sep 2003, Andrew M. Bishop wrote: > Andy Rabagliati <[EMAIL PROTECTED]> writes: > > > On Sat, 06 Sep 2003, Andrew M. Bishop wrote: > > > > > Andy Rabagliati <[EMAIL PROTECTED]> writes: > > > > > > > Can you hash these into subdirectories please, like squid and co.? > > > > > > What size directories do you have that cause problems? > > > > Hmmm .. > > > > Out of 2552 site directories, > > > > % cd /var/spool/wwwoffle/http ; for f in * ; do echo -n $f ; ls $f | wc -l ; done > > | sort --key=2 | tail -18 > > > > news.google.com 682 > > www.csmonitor.com 1200 > > a.coza.com 1720 > > news.bbc.co.uk 2390 > > allafrica.com 4608 > > [ valid points re: problems with subdirectories ] > > > > % ls -l /var/spool/wwwoffle/http | sort --key=5 -n | tail -6 > > > > drwxrwxr-x 2 apache uucp 73728 Sep 7 12:09 news.google.com > > drwxrwxr-x 2 apache uucp 77824 Sep 7 12:11 news.bbc.co.uk > > drwxrwxr-x 2 apache uucp 77824 Sep 7 12:19 adsrv.iol.co.za > > drwxrwxr-x 2 apache uucp 139264 Sep 7 12:28 www.mg.co.za > > drwxrwxr-x 2 apache uucp 151552 Sep 7 12:25 allafrica.com > > drwxrwxr-x 2 apache uucp 552960 Sep 7 12:19 www.iol.co.za > > > > This gives the (ext3) directory size, which I believe is a function of the > > largest number of files that have ever been there - so it is likely that > > there have been many more files in some directories. > > It does look like there might have been some much bigger directories > at some point. It is possible that these huge directory sizes are > causing part of the slow down even if there are not a huge number of > files. Unless the data in the directory itself is compacted when > things are deleted it will need to search through the whole 552 kB to > check if a file exists. This directory probably only needs to be > about 12 kB (32 bytes per file) rather than 552 kB. > > You could try reducing the size of the directories by doing: > > mv www.iol.co.za www.iol.co.za.bak > mkdir www.iol.co.za > mv www.iol.co.za.bak/* www.iol.co.za > rmdir www.iol.co.za.bak
Indeed, I am aware of that. I could even do it with a cronjob .. but it seems annoying to have to do this. There is no way I am doing it by hand .. > > What we like to do, however, is to do a recursive wget of an entire site > > (like www.enature.com) over a week or so (in pieces, overnight) and then > > direct the class to that site. > > > > I have long expiry times, and some complete sites. > > This sort of usage of WWWOFFLE will tend to bring out the worst features. Indeed ... > > > What are you doing when you notice that there is the time delay (is it > > > creating the host index, or opening a URL from the host or something > > > else)? > > > > Opening a URL from a browser. I hide all the index creation stuff from > > the casual user - and run the important ones from a cronjob in the > > morning and present it as a static page. > > I think it could be the huge directory entries that are causing the > problem. I don't know for sure, but I would imagine that it is the > size of the directory rather than the number of files that would be > the key feature. > > > > > What filesystem are you using, reiserfs has all sorts of features to > > > speed up directory accesses, perhaps this would help? > > > > I use ext3. > > > > My /var partition is separate - for Maildir mail and wwwoffle, so it > > would be possible to use reiserfs for this. > > I have another suggestion as well that I just found by searching my > WWWOFFLE cache. The latest Linux kernels have a better ext3 directory > as described at http://lwn.net/Articles/39901/ > > : EXT3. > : ~~~~~ > : - The ext3 filesystem has gained indexed directory support, which offers > : considerable performance gains when used on filesystems with directories > : containing large numbers of files. > : - In order to use the htree feature, you need at least version 1.32 of > : e2fsprogs. > : - Existing filesystems can be converted using the command > : > : tune2fs -O dir_index /dev/hdXXX > : > : - The latest e2fsprogs can be found at > : http://prdownloads.sourceforge.net/e2fsprogs Redhat does not have this in their updates yet - they are at 1.27 for RH8.0. I will get a new version, and try it out. Someone else mentioned using db4 to hold URLs, and presumably paths could also be held there - allowing a woffle-purge sweep or something to flag selective directories as candidates for subdir hashing. However, the reason I like wwwoffle is because I can tar up in one place and untar on top of another, and it 'just works'. But presumably some simple DB operation could be done to synchronise DBs. I also seem to remember some special mount options that were used for NNTP newsspools - aha - noatime Do not update inode access times on this file system (e.g, for faster access on the news spool to speed up news servers). Maybe this would improve matters - I use IDE disk drives. Would this cause problems ? Thanks for all your help, and doing some of my homework for me, and your very useful web cache .. Cheers, Andy!
