We need to decide before releasing the new Spider plugin, because once people start running serious spiders to produce new format indexes, it will be very painful to change this.
Should we hash terms before inserting them into the btree? CON: Current way has spatial locality. Terms in the same language (to the degree that languages use different charsets, or have words starting with wierd prefixes), terms with the same stem, junk numbers and number-symbol combinations (there are surprising numbers of these) tend to be close together and in the same bunch of nodes. This may improve the efficiency of caching slightly, and means that if a particular charset or a large group of numbers is unpopular, it will tend to fall out first. Also we can search for alphabetically adjacent words, although I don't see how that would help at the moment. PRO: Spreading stuff out would ensure that (more or less) all terms fall out at the same rate, and would reduce the amount of rebalancing we have to do, which should result in slightly fewer nodes being inserted on an update. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20100531/bbfd63d8/attachment.pgp>
