On Sat, 06 Sep 2003, Andrew M. Bishop wrote:
> Andy Rabagliati <[EMAIL PROTECTED]> writes:
>
> > At the moment a sites files, and URL unhash, are kept in a flat
> > directory /var/spool/wwwoffle/http/www.domain.com/*.
> >
> > These directories can get really big, and can take a significant
> > time to open.
> >
> > Can you hash these into subdirectories please, like squid and co.?
>
> For example, I currently have 808 directories for different hosts that
> I have cached files from. Of this total there are 195 directories
> that have only a single URL stored in them (24.4%). Directories with
> 5 URLs or fewer make up 47.3% of them, 10 URLs or fewer is 59.8% and
> 20 URLs or fewer is 75.5%. I don't know what you would consider as a
> large directory that would be slow, but I would guess 256 files is OK.
> This would be 128 URLs which is 97.4% of the directories. This means
> that there are fewer than 3% of the directories that would benefit
> from this change.
> What size directories do you have that cause problems?
Hmmm ..
Out of 2552 site directories,
% cd /var/spool/wwwoffle/http ; for f in * ; do echo -n $f ; ls $f | wc -l ; done |
sort --key=2 | tail -18
us.news1.yimg.com 198
slashdot.org 220
www.microsoft.com 220
www.sfgate.com 222
www.google.com 236
www.alsangels.com 288
images.slashdot.org 336
images-aud.slashdot.org 366
www.iol.co.za 374
ad.za.doubleclick.net 388
ar.atwola.com 498
adsrv.iol.co.za 514
ads.osdn.com 518
news.google.com 682
www.csmonitor.com 1200
a.coza.com 1720
news.bbc.co.uk 2390
allafrica.com 4608
% ls -l /var/spool/wwwoffle/http | sort --key=5 -n | tail -6
drwxrwxr-x 2 apache uucp 73728 Sep 7 12:09 news.google.com
drwxrwxr-x 2 apache uucp 77824 Sep 7 12:11 news.bbc.co.uk
drwxrwxr-x 2 apache uucp 77824 Sep 7 12:19 adsrv.iol.co.za
drwxrwxr-x 2 apache uucp 139264 Sep 7 12:28 www.mg.co.za
drwxrwxr-x 2 apache uucp 151552 Sep 7 12:25 allafrica.com
drwxrwxr-x 2 apache uucp 552960 Sep 7 12:19 www.iol.co.za
This gives the (ext3) directory size, which I believe is a function of the
largest number of files that have ever been there - so it is likely that
there have been many more files in some directories.
This is my development station - I do not have immediate access to all my
school installations, which are liable to be much bigger. So, my stats
are probably similar to yours.
What we like to do, however, is to do a recursive wget of an entire site
(like www.enature.com) over a week or so (in pieces, overnight) and then
direct the class to that site.
I have long expiry times, and some complete sites.
Schools have a bad habit of being worst-case for everything (Class -
check your email ..) and it is always me to blame ..
>From one of the smaller schools that happens to be online :-
% for f in * ; do echo -n $f ; ls $f | wc -l ; done | sort --key=2 | tail -18
www.delphinium.co.nz 270
www.rapidsearch.com 272
www.csmonitor.com 292
cdn.mapquest.com 344
www.uvm.edu 348
www.perennials.com 366
www.edmundsroses.com 416
www.bhg.com 434
www.bbg.org 454
members.shaw.ca 498
www.sierra.com 542
www.amazon.com 634
www.pal-metto.com 696
www.egypt.com 774
images.meredith.com 978
g-images.amazon.com 1134
images.amazon.com 2304
www.stokestropicals.com 18614
# ls -l /var/spool/wwwoffle/http | sort --key=5 -n | tail -6
drwxrwxr-x 2 apache uucp 77824 Sep 7 05:54 images.amazon.com
drwxrwxr-x 2 apache uucp 98304 Sep 6 21:08 www.nytimes.com
drwxrwxr-x 2 apache uucp 110592 Sep 6 21:08 www.csmonitor.com
drwxrwxr-x 2 apache uucp 143360 May 19 15:22 www.bday.co.za
drwxrwxr-x 2 apache uucp 258048 Sep 6 21:08
www.nationalgeographic.com
drwxrwxr-x 2 apache uucp 598016 Sep 7 05:53 www.stokestropicals.com
> What are you doing when you notice that there is the time delay (is it
> creating the host index, or opening a URL from the host or something
> else)?
Opening a URL from a browser. I hide all the index creation stuff from
the casual user - and run the important ones from a cronjob in the
morning and present it as a static page.
That one nailed me early.
> What filesystem are you using, reiserfs has all sorts of features to
> speed up directory accesses, perhaps this would help?
I use ext3.
My /var partition is separate - for Maildir mail and wwwoffle, so it
would be possible to use reiserfs for this.
Thanks for the suggestion.
Cheers, Andy!