On Sun, 07 Sep 2003, Andrew M. Bishop wrote:

> Andy Rabagliati <[EMAIL PROTECTED]> writes:
> 
> > On Sat, 06 Sep 2003, Andrew M. Bishop wrote:
> > 
> > > Andy Rabagliati <[EMAIL PROTECTED]> writes:
> > > 
> > > > Can you hash these into subdirectories please, like squid and co.?
> > > 
> > > What size directories do you have that cause problems?
> > 
> > Hmmm ..
> > 
> > Out of 2552 site directories,
> > 
> > % cd /var/spool/wwwoffle/http ; for f in * ; do echo -n $f ; ls $f | wc -l ; done 
> > | sort --key=2 | tail -18
> > 
> >     news.google.com    682
> >     www.csmonitor.com   1200
> >     a.coza.com   1720
> >     news.bbc.co.uk   2390
> >     allafrica.com   4608
> 
> [ valid points re: problems with subdirectories ]
> 
> 
> > % ls -l /var/spool/wwwoffle/http | sort --key=5 -n | tail -6 
> > 
> >     drwxrwxr-x    2 apache   uucp        73728 Sep  7 12:09 news.google.com
> >     drwxrwxr-x    2 apache   uucp        77824 Sep  7 12:11 news.bbc.co.uk
> >     drwxrwxr-x    2 apache   uucp        77824 Sep  7 12:19 adsrv.iol.co.za
> >     drwxrwxr-x    2 apache   uucp       139264 Sep  7 12:28 www.mg.co.za
> >     drwxrwxr-x    2 apache   uucp       151552 Sep  7 12:25 allafrica.com
> >     drwxrwxr-x    2 apache   uucp       552960 Sep  7 12:19 www.iol.co.za
> > 
> > This gives the (ext3) directory size, which I believe is a function of the
> > largest number of files that have ever been there - so it is likely that
> > there have been many more files in some directories.
> 
> It does look like there might have been some much bigger directories
> at some point.  It is possible that these huge directory sizes are
> causing part of the slow down even if there are not a huge number of
> files.  Unless the data in the directory itself is compacted when
> things are deleted it will need to search through the whole 552 kB to
> check if a file exists.  This directory probably only needs to be
> about 12 kB (32 bytes per file) rather than 552 kB.
> 
> You could try reducing the size of the directories by doing:
> 
> mv www.iol.co.za www.iol.co.za.bak
> mkdir www.iol.co.za
> mv www.iol.co.za.bak/* www.iol.co.za
> rmdir www.iol.co.za.bak

Indeed, I am aware of that.

I could even do it with a cronjob .. but it seems annoying to have to do this.
There is no way I am doing it by hand ..

> > What we like to do, however, is to do a recursive wget of an entire site
> > (like www.enature.com) over a week or so (in pieces, overnight) and then
> > direct the class to that site.
> > 
> > I have long expiry times, and some complete sites.
> 
> This sort of usage of WWWOFFLE will tend to bring out the worst features.

Indeed ...
 
> > > What are you doing when you notice that there is the time delay (is it
> > > creating the host index, or opening a URL from the host or something
> > > else)?
> > 
> > Opening a URL from a browser. I hide all the index creation stuff from
> > the casual user - and run the important ones from a cronjob in the
> > morning and present it as a static page.
> 
> I think it could be the huge directory entries that are causing the
> problem.  I don't know for sure, but I would imagine that it is the
> size of the directory rather than the number of files that would be
> the key feature.
> 
> 
> > > What filesystem are you using, reiserfs has all sorts of features to
> > > speed up directory accesses, perhaps this would help?
> > 
> > I use ext3.
> > 
> > My /var partition is separate - for Maildir mail and wwwoffle, so it
> > would be possible to use reiserfs for this.
> 
> I have another suggestion as well that I just found by searching my
> WWWOFFLE cache.  The latest Linux kernels have a better ext3 directory
> as described at http://lwn.net/Articles/39901/
> 
> : EXT3.
> : ~~~~~
> : - The ext3 filesystem has gained indexed directory support, which offers
> :   considerable performance gains when used on filesystems with directories
> :   containing large numbers of files.
> : - In order to use the htree feature, you need at least version 1.32 of
> :   e2fsprogs.
> : - Existing filesystems can be converted using the command
> : 
> :     tune2fs -O dir_index /dev/hdXXX
> : 
> : - The latest e2fsprogs can be found at
> :   http://prdownloads.sourceforge.net/e2fsprogs

Redhat does not have this in their updates yet - they are at 1.27 for RH8.0.

I will get a new version, and try it out.

Someone else mentioned using db4 to hold URLs, and presumably paths could
also be held there - allowing a woffle-purge sweep or something to flag
selective directories as candidates for subdir hashing.

However, the reason I like wwwoffle is because I can tar up in one place
and untar on top of another, and it 'just works'. But presumably some simple
DB operation could be done to synchronise DBs.

I also seem to remember some special mount options that were used for
NNTP newsspools - aha -

  noatime 
       Do not update inode access times on this file system (e.g, for
       faster access on the news spool to speed up news servers).

Maybe this would improve matters - I use IDE disk drives. Would this
cause problems ?

Thanks for all your help, and doing some of my homework for me,
and your very useful web cache ..

Cheers,     Andy!

Reply via email to