Andrew M. Bishop wrote:

What is it that you are trying to optimise, time (to create indexes)
or space (of all the U* files on disk)?

It isn't "either/or": fortunately my method makes WWWOFFLE more efficient in both time and (disk) space.


If it is the time to create the index pages then any speedup from
your scheme would be lost to me.
...
With the way that I use WWWOFFLE this means that I only requested 40
index pages out of about 27000 pages in total (excluding htdig
running).  This means that any increase in speed from the database of
U* files would save me no time.

Well, then it is a matter of personal priorities. I generate index pages several times a day. And some of the sites I visit most frequently have several thousand entries in the cache. To me it matters a lot whether an index of this size takes several seconds or a minute to generate. It also matters to me if a "wwwoffle -purge" takes a few minutes or a quarter of an hour.


If it is space on the disk that you are saving then there is no
significant saving compared to combining the U* and D* files in one
single file.  There will statistically be some saving with all of the
U* data in a single file, but this is probably only 10% better than
putting the U* data into the D* files.

This idea had been in the back of my mind a long time, but I realized that it would do little to speed things up (and might even slow things down). It wasn't until I looked at Marc Boucher's wwwoffle-ls2 script that I was convinced that a URL hash table was the way forward.


I also did some optimisations recently (a few months ago actually)
which you will all be able to enjoy in version 2.9.

It's good to hear that WWWOFFLE is still being improved. The changes from 2.7 to 2.8 were very welcome. I'll be looking forward to seeing the new optimisations.

--
Paul Rombouts




Reply via email to