Re: [WWWOFFLE-Users] No more U* files

Paul A. Rombouts Sat, 14 May 2005 05:46:23 -0700

Micha wrote:

[EMAIL PROTECTED] (Andrew M. Bishop):
"Paul A. Rombouts" <[EMAIL PROTECTED]> writes:


PAR>>>Having many small files is quite inefficient.
PAR>>>For example, on my system each U*
PAR>>> file occupies 4K of disk space.

Micha> Would there be any problem to make a wwwoffle cache partition
Micha> 'blocksize 1024' ?
Micha> With ext3 or Reiser ?

As I have written in another posting, I appreciate it is possible to remedy the wastage of disk space (to a degree) by tuning your file system. But the main problem for me is that reading thousands of small files is very time consuming and that cannot be significantly improved by tweaking the file system. I run WWWOFFLE on very old hardware. The speedup resulting from using one URL database file is very noticeable and very useful to me. The extra diskspace that has become available is a welcome added bonus.


AMB>>There is no one special file that contains all of the magic to enable
AMB>>the program to work.  Even better there is no single file that will
AMB>>cause the program to fail if it gets lost or corrupted.

Micha> How about a compromise: Have a hash file in every domain dir ?
Micha> Maybe for some weird reason one domain could get corrupted.
Micha> But it would not affect any other. And still you can operate
Micha> on single domains easily.

Marc Boucher has written a wwwoffle-ls2 utility in Perl that uses this method to great effect. Actually, it was the wwwoffle-ls2 script that has taught me that a lot can be gained by getting rid of the U* files and has inspired my solution. I have had a discussion with Marc about whether having a hash file in every domain directory, which Marc prefers, or a single hash file, as I have implemented in WWWOFFLE, is better. I appreciate that Marc's method has a lot of merit. But for purely technical reasons, I found it easier to implement a single database file. If Marc finds the time to implement his method in WWWOFFLE, I will be very interested to test which method works better.


AMB>>I also keep WWWOFFLE simple by not having the processes communicate
AMB>>between themselves.  WWWOFFLE has many processes rather than
AMB>>multi-threading or any other inter-process communication.  This means
AMB>>that any of the processes can die or start corrupting memory without
AMB>>affecting any other.

Micha> With pages full of weird javascript or embedded flashs 'out
Micha> there' (and who knows what next), this does sound comprehensible
Micha> to me.
Micha> I've got problems enought with the browser threadings.

As I have explained in another posting, the risk of failure propagation caused by memory corruption using the type of shared memory I have used is very limited. Only a limited range of memory used for looking up URLs is shared, the rest is protected as before. And "weird javascript or embedded flashs" are only a problem for the browser, not WWWOFFLE, which doesn't interpret this kind of data but merely copies it or passes it on.

--
Paul Rombouts

Re: [WWWOFFLE-Users] No more U* files

Reply via email to