On Sat, 06 Sep 2003, Andrew M. Bishop wrote:

> Andy Rabagliati <[EMAIL PROTECTED]> writes:
> 
> > At the moment a sites files, and URL unhash, are kept in a flat
> > directory /var/spool/wwwoffle/http/www.domain.com/*.
> > 
> > These directories can get really big, and can take a significant
> > time to open.
> > 
> > Can you hash these into subdirectories please, like squid and co.?
> 
> For example, I currently have 808 directories for different hosts that
> I have cached files from.  Of this total there are 195 directories
> that have only a single URL stored in them (24.4%).  Directories with
> 5 URLs or fewer make up 47.3% of them, 10 URLs or fewer is 59.8% and
> 20 URLs or fewer is 75.5%.  I don't know what you would consider as a
> large directory that would be slow, but I would guess 256 files is OK.
> This would be 128 URLs which is 97.4% of the directories.  This means
> that there are fewer than 3% of the directories that would benefit
> from this change.


> What size directories do you have that cause problems?

Hmmm ..

Out of 2552 site directories,

% cd /var/spool/wwwoffle/http ; for f in * ; do echo -n $f ; ls $f | wc -l ; done | 
sort --key=2 | tail -18

        us.news1.yimg.com    198
        slashdot.org    220
        www.microsoft.com    220
        www.sfgate.com    222
        www.google.com    236
        www.alsangels.com    288
        images.slashdot.org    336
        images-aud.slashdot.org    366
        www.iol.co.za    374
        ad.za.doubleclick.net    388
        ar.atwola.com    498
        adsrv.iol.co.za    514
        ads.osdn.com    518
        news.google.com    682
        www.csmonitor.com   1200
        a.coza.com   1720
        news.bbc.co.uk   2390
        allafrica.com   4608

% ls -l /var/spool/wwwoffle/http | sort --key=5 -n | tail -6 

        drwxrwxr-x    2 apache   uucp        73728 Sep  7 12:09 news.google.com
        drwxrwxr-x    2 apache   uucp        77824 Sep  7 12:11 news.bbc.co.uk
        drwxrwxr-x    2 apache   uucp        77824 Sep  7 12:19 adsrv.iol.co.za
        drwxrwxr-x    2 apache   uucp       139264 Sep  7 12:28 www.mg.co.za
        drwxrwxr-x    2 apache   uucp       151552 Sep  7 12:25 allafrica.com
        drwxrwxr-x    2 apache   uucp       552960 Sep  7 12:19 www.iol.co.za

This gives the (ext3) directory size, which I believe is a function of the
largest number of files that have ever been there - so it is likely that
there have been many more files in some directories.

This is my development station - I do not have immediate access to all my
school installations, which are liable to be much bigger. So, my stats
are probably similar to yours.

What we like to do, however, is to do a recursive wget of an entire site
(like www.enature.com) over a week or so (in pieces, overnight) and then
direct the class to that site.

I have long expiry times, and some complete sites.

Schools have a bad habit of being worst-case for everything (Class -
check your email ..) and it is always me to blame ..

>From one of the smaller schools that happens to be online :-

% for f in * ; do echo -n $f ; ls $f | wc -l ; done | sort --key=2 | tail -18

        www.delphinium.co.nz    270
        www.rapidsearch.com    272
        www.csmonitor.com    292
        cdn.mapquest.com    344
        www.uvm.edu    348
        www.perennials.com    366
        www.edmundsroses.com    416
        www.bhg.com    434
        www.bbg.org    454
        members.shaw.ca    498
        www.sierra.com    542
        www.amazon.com    634
        www.pal-metto.com    696
        www.egypt.com    774
        images.meredith.com    978
        g-images.amazon.com   1134
        images.amazon.com   2304
        www.stokestropicals.com  18614


# ls -l /var/spool/wwwoffle/http | sort --key=5 -n | tail -6

        drwxrwxr-x    2 apache   uucp        77824 Sep  7 05:54 images.amazon.com
        drwxrwxr-x    2 apache   uucp        98304 Sep  6 21:08 www.nytimes.com
        drwxrwxr-x    2 apache   uucp       110592 Sep  6 21:08 www.csmonitor.com
        drwxrwxr-x    2 apache   uucp       143360 May 19 15:22 www.bday.co.za
        drwxrwxr-x    2 apache   uucp       258048 Sep  6 21:08 
www.nationalgeographic.com
        drwxrwxr-x    2 apache   uucp       598016 Sep  7 05:53 www.stokestropicals.com


> What are you doing when you notice that there is the time delay (is it
> creating the host index, or opening a URL from the host or something
> else)?

Opening a URL from a browser. I hide all the index creation stuff from
the casual user - and run the important ones from a cronjob in the
morning and present it as a static page.

That one nailed me early.

> What filesystem are you using, reiserfs has all sorts of features to
> speed up directory accesses, perhaps this would help?

I use ext3.

My /var partition is separate - for Maildir mail and wwwoffle, so it
would be possible to use reiserfs for this.

Thanks for the suggestion.

Cheers,    Andy!

Reply via email to