I have been working on integrating infinity0's new format search indexes with 
XMLSpider.

Old format indexes:
- XMLSpider spiders Freenet, starting with the bookmarks, and finds all 
text-based pages (HTML and plain text currently).
- XMLSpider indexes all words found on text-based pages, in its internal Perst 
database.
- When the user asks XMLSpider to generate an index, it reads its database, 
involving a vast amount of seeking, and generates an old-format index. This can 
take a week or more.
- Basically an old format index is an XML file with a list of sub-files, split 
by the first few characters of the hexadecimal version of the md5 of the word. 
Each sub-file, also XML, contains a list of files followed by a list of words. 
The sub-files and the main index file can both get huge, and the format is 
fundamentally not very scalable. They are all inserted in a single USK 
manifest. One possible alternative would be to not insert it as a site, just 
post the number of hex digits rather than the actual split, and rely on SSKs, 
but there are problems with this too - poor balancing, reliability, etc.
- The old-format index is published as a (gigantic) directory; the user must 
insert it manually.

You can see an old format index here:
USK at 
5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index.xml
And an example sub-index within it:
USK at 
5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index_00.xml

The new solution:
- Instead of (currently as well as) writing the index to its database, 
XMLSpider buffers indexed page data in RAM and sends it to Library when it 
passes a given threshold.
- On an orderly shutdown, the data buffered in RAM but not yet sent will be 
written to disk. On a disorderly shutdown it is possible that XMLSpider will 
lose data.
- Library reads the data, writes it to disk, and then attempts to update the 
on-Freenet search index directly. It only uploads the btree nodes that have 
actually been updated, or their parent nodes.
- This is all based on infinity0's work last summer and since then on scalable 
forkable/copy-on-write on-freenet btree indexes. This is a data structure that 
will probably have wider applications (e.g. wiki's), but it is immensely useful 
here.
- The on-freenet btree indexes are scalable, unlike the old format. They will 
use a depth of more than 2 if necessary, although have fairly large nodes for a 
good fan-out. The top levels can be (but aren't yet) prefetched to improve 
performance.
- In future, the on-freenet btree indexes will include the redundant splitfile 
metadata of each layer in the layer above, making them immune to the loss of 
any single key apart from the USK at the top. This will cost us some fan-out 
but is acceptable.
- In future, the on-freenet btree indexes will likely be structured in such a 
way that we can fetch a single block (on the critical path, 2 more for 
propagation) for each layer, in the common case where the block is available. 
If the block isn't available we'll fetch the whole node. This combined with 
prefetch should make for *REALLY* fast searching.
- The indexes are "forkable" or "copy on write". This allows both for updating 
the tree on freenet, only uploading the changed nodes, and for *updating 
somebody else's tree* (resulting in a different top CHK, but reusing all the 
untouched old nodes).

What actually works now, and drawbacks:
- I have not yet released the new XMLSpider and Library. I have been testing 
and debugging and ran into many issues. But you can try them out. I will 
release them soon if it continues to work.
- The indexes are uploaded to a CHK, a USK is then inserted to redirect to it.
- We cache the uploaded data in library-spider-pushed-data-cache/. This 
directory is *NOT GARBAGE COLLECTED* at present. You can however just delete it 
if you are confident fred will be able to re-fetch all the index data from the 
network.
- Index data is stored in library.index.data.[number], data from a clean 
shutdown from XMLSpider is stored in xmlspider.saved.data, the current edition 
is in library.index.edition, the last pushed index as a CHK is in 
library.index.lastpushed.chk, the public and private keys are 
inlibrary.index.pubkey and library.index.privkey. None of these files are 
encrypted. Neither is the XMLSpider database, or the Library bookmarks list. 
These should be added to the relevant bug.
- It takes approximately 90 minutes to do a progressive insert for 16MB of data 
from the spider. This is ridiculously slow when the spider first starts 
working, fetching most stuff from the store. However, later on, when it is 
mostly failing and retrying, it may be acceptable. I dunno, I have not yet 
reached this point in my testing due to bugs! The last 3 successful inserts 
here took 86, 90 and 51 minutes respectively.
- This happens in parallel with the spider, but if the lag gets too much 
(currently 5 such chunks), the spider ends up blocking, waiting for the chunks 
to be inserted.
- On an unclean shutdown, we re-fetch and re-parse the pages that we fetched 
since the last time we uploaded data to Library.
- XMLSpider still indexes all the data in its database, and is therefore able 
to generate old-format indexes if required. New format index support must be 
enabled by setting the buffer size, which can be 0 for disabled, up to 128MB. 
Note that the actual memory usage is probably significantly larger than the 
buffer size estimate! I'm using a setting of 16MB for testing, with 4GB node 
max memory. You should set the config option before/when you set the number of 
fetches, on a new database. There is no support for migrating old databases: we 
will only write new format indexes from the progressive data mentioned above.
- Support for old-format indexes means XMLSpider does a good deal more work 
(CPU, disk, memory) than it needs to. This will be removed eventually, or 
perhaps made configurable or forked.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/tech/attachments/20100512/61241e54/attachment.pgp>

Reply via email to