Andy Rabagliati <[EMAIL PROTECTED]> writes:

> On Sat, 06 Sep 2003, Andrew M. Bishop wrote:
> 
> > Andy Rabagliati <[EMAIL PROTECTED]> writes:
> > 
> > > At the moment a sites files, and URL unhash, are kept in a flat
> > > directory /var/spool/wwwoffle/http/www.domain.com/*.
> > > 
> > > These directories can get really big, and can take a significant
> > > time to open.
> > > 
> > > Can you hash these into subdirectories please, like squid and co.?
> > 
> > For example, I currently have 808 directories for different hosts that
> > I have cached files from.  Of this total there are 195 directories
> > that have only a single URL stored in them (24.4%).  Directories with
> > 5 URLs or fewer make up 47.3% of them, 10 URLs or fewer is 59.8% and
> > 20 URLs or fewer is 75.5%.  I don't know what you would consider as a
> > large directory that would be slow, but I would guess 256 files is OK.
> > This would be 128 URLs which is 97.4% of the directories.  This means
> > that there are fewer than 3% of the directories that would benefit
> > from this change.

> > What size directories do you have that cause problems?
> 
> Hmmm ..
> 
> Out of 2552 site directories,
> 
> % cd /var/spool/wwwoffle/http ; for f in * ; do echo -n $f ; ls $f | wc -l ; done | 
> sort --key=2 | tail -18
> 
>       us.news1.yimg.com    198
>       slashdot.org    220
>       www.microsoft.com    220
>       www.sfgate.com    222
>       www.google.com    236
>       www.alsangels.com    288
>       images.slashdot.org    336
>       images-aud.slashdot.org    366
>       www.iol.co.za    374
>       ad.za.doubleclick.net    388
>       ar.atwola.com    498
>       adsrv.iol.co.za    514
>       ads.osdn.com    518
>       news.google.com    682
>       www.csmonitor.com   1200
>       a.coza.com   1720
>       news.bbc.co.uk   2390
>       allafrica.com   4608

You seem to have about the same spread of directory sizes as I did and
others have got (noting that you have counted files and not URLs like
I did).  There are only a few directories with many files that would
benefit from the change to multiple sub-directories.

Typically the number of sub-directories that you would want to create
would be equal to the square root of the number of files for the host
(so that the number of sub-directories is about equal to the number of
files in each).  With such a wide variation in the number of files in
a directory it would be difficult to chose this number to work well
everywhere.

If a fixed number of sub-directories are created everytime that a new
directory is created for a host then the hosts with few URLs would
lose out.  To find a URL now requires two levels of directory search,
one to find the sub-directory and one to find the file in it.

Also everytime that you examine the contents of a directory you change
the access time of it so you need to write to the disk.  With
sub-directories for each host you will need to make more writes to the
disk for each URL that you look up, twice as many in fact.

You could only create the sub-directories when you need to write a URL
to a sub-directory that doesn't exist.  This won't have such a big
problem with lots of empty directories, but you add extra complication.

You could only create the sub-directories when the host directory
reaches a critical size.  But then you make life worse since you need
to handle both the case where there is sub-directories and the case
like now where there are not.


> % ls -l /var/spool/wwwoffle/http | sort --key=5 -n | tail -6 
> 
>       drwxrwxr-x    2 apache   uucp        73728 Sep  7 12:09 news.google.com
>       drwxrwxr-x    2 apache   uucp        77824 Sep  7 12:11 news.bbc.co.uk
>       drwxrwxr-x    2 apache   uucp        77824 Sep  7 12:19 adsrv.iol.co.za
>       drwxrwxr-x    2 apache   uucp       139264 Sep  7 12:28 www.mg.co.za
>       drwxrwxr-x    2 apache   uucp       151552 Sep  7 12:25 allafrica.com
>       drwxrwxr-x    2 apache   uucp       552960 Sep  7 12:19 www.iol.co.za
> 
> This gives the (ext3) directory size, which I believe is a function of the
> largest number of files that have ever been there - so it is likely that
> there have been many more files in some directories.

It does look like there might have been some much bigger directories
at some point.  It is possible that these huge directory sizes are
causing part of the slow down even if there are not a huge number of
files.  Unless the data in the directory itself is compacted when
things are deleted it will need to search through the whole 552 kB to
check if a file exists.  This directory probably only needs to be
about 12 kB (32 bytes per file) rather than 552 kB.

You could try reducing the size of the directories by doing:

mv www.iol.co.za www.iol.co.za.bak
mkdir www.iol.co.za
mv www.iol.co.za.bak/* www.iol.co.za
rmdir www.iol.co.za.bak



> What we like to do, however, is to do a recursive wget of an entire site
> (like www.enature.com) over a week or so (in pieces, overnight) and then
> direct the class to that site.
> 
> I have long expiry times, and some complete sites.

This sort of usage of WWWOFFLE will tend to bring out the worst features.


> > What are you doing when you notice that there is the time delay (is it
> > creating the host index, or opening a URL from the host or something
> > else)?
> 
> Opening a URL from a browser. I hide all the index creation stuff from
> the casual user - and run the important ones from a cronjob in the
> morning and present it as a static page.

I think it could be the huge directory entries that are causing the
problem.  I don't know for sure, but I would imagine that it is the
size of the directory rather than the number of files that would be
the key feature.


> > What filesystem are you using, reiserfs has all sorts of features to
> > speed up directory accesses, perhaps this would help?
> 
> I use ext3.
> 
> My /var partition is separate - for Maildir mail and wwwoffle, so it
> would be possible to use reiserfs for this.

I have another suggestion as well that I just found by searching my
WWWOFFLE cache.  The latest Linux kernels have a better ext3 directory
as described at http://lwn.net/Articles/39901/

: EXT3.
: ~~~~~
: - The ext3 filesystem has gained indexed directory support, which offers
:   considerable performance gains when used on filesystems with directories
:   containing large numbers of files.
: - In order to use the htree feature, you need at least version 1.32 of
:   e2fsprogs.
: - Existing filesystems can be converted using the command
: 
:     tune2fs -O dir_index /dev/hdXXX
: 
: - The latest e2fsprogs can be found at
:   http://prdownloads.sourceforge.net/e2fsprogs

-- 
Andrew.
----------------------------------------------------------------------
Andrew M. Bishop                             [EMAIL PROTECTED]
                                      http://www.gedanken.demon.co.uk/

WWWOFFLE users page:
        http://www.gedanken.demon.co.uk/wwwoffle/version-2.7/user.html

Reply via email to