Having many small files is quite inefficient. For example, on my system each U* file occupies 4K of disk space. With 78587 U* files that is 307MB of occupied disk space. However, the total size of these files is little more than 5MB.
My initial motivation for getting rid of the U* files was my irritation with the long time it took WWWOFFLE to generate an index page of a cache directory containing thousands of files. After a while I started to suspect that reading many small files is a very expensive operation. When Marc Boucher explained to me how his wwwoffle-ls2 utility uses a separate lookup table to make pattern matching of URLs significantly more efficient, I got an idea how to implement this directly in WWWOFFLE.
In the past few weeks I have been experimenting with a design where the URLs of the cached webpages are stored in a single compact database file called "urlhashtable". This file is mmapped to an area of address space that is shared between all WWWOFFLE processes.
So far, I am very pleased with the results. The amount of disk space the database file occupies is less that 7MB. The improvement in the time it takes to generate index pages is dramatic. Also a WWWOFFLE purge operation with "use-url = yes" is now much faster (by as much as a factor of 4 on my system). I also believe that storing the contents of webpages while online has become somewhat faster (because only one file per webpage needs to be written instead of two), but I haven't actually done any measurements to verify this.
Should you wish to try this out for yourself and/or study how my implementation works, I have made a patch file available on my WWWOFFLE webpage:
http://www.phys.uu.nl/~rombouts/wwwoffle.html or http://members.home.nl/p.a.rombouts/wwwoffle.html
-- Paul Rombouts
