[freenet-dev] Should new format indexes be sorted by term or by term hash?

Matthew Toseland Mon, 31 May 2010 20:21:10 +0100

We need to decide before releasing the new Spider plugin, because once people 
start running serious spiders to produce new format indexes, it will be very 
painful to change this.


Should we hash terms before inserting them into the btree?

CON:
Current way has spatial locality. Terms in the same language (to the degree 
that languages use different charsets, or have words starting with wierd 
prefixes), terms with the same stem, junk numbers and number-symbol 
combinations (there are surprising numbers of these) tend to be close together 
and in the same bunch of nodes. This may improve the efficiency of caching 
slightly, and means that if a particular charset or a large group of numbers is 
unpopular, it will tend to fall out first. Also we can search for 
alphabetically adjacent words, although I don't see how that would help at the 
moment.

PRO:
Spreading stuff out would ensure that (more or less) all terms fall out at the 
same rate, and would reduce the amount of rebalancing we have to do, which 
should result in slightly fewer nodes being inserted on an update.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100531/bbfd63d8/attachment.pgp>

[freenet-dev] Should new format indexes be sorted by term or by term hash?

Reply via email to