[Tech] Status of new format spider indexes

Matthew Toseland Wed, 12 May 2010 18:10:00 +0100

I have been working on integrating infinity0's new format search indexes with 
XMLSpider.

Old format indexes:
- XMLSpider spiders Freenet, starting with the bookmarks, and finds all
text-based pages (HTML and plain text currently).
- XMLSpider indexes all words found on text-based pages, in its internal Perst
database.
- When the user asks XMLSpider to generate an index, it reads its database,
involving a vast amount of seeking, and generates an old-format index. This can
take a week or more.
- Basically an old format index is an XML file with a list of sub-files, split
by the first few characters of the hexadecimal version of the md5 of the word.
Each sub-file, also XML, contains a list of files followed by a list of words.
The sub-files and the main index file can both get huge, and the format is
fundamentally not very scalable. They are all inserted in a single USK
manifest. One possible alternative would be to not insert it as a site, just
post the number of hex digits rather than the actual split, and rely on SSKs,
but there are problems with this too - poor balancing, reliability, etc.
- The old-format index is published as a (gigantic) directory; the user must
insert it manually.

You can see an old format index here:
USK at
5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index.xml
And an example sub-index within it:
USK at
5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index_00.xml

The new solution:
- Instead of (currently as well as) writing the index to its database,
XMLSpider buffers indexed page data in RAM and sends it to Library when it
passes a given threshold.
- On an orderly shutdown, the data buffered in RAM but not yet sent will be
written to disk. On a disorderly shutdown it is possible that XMLSpider will
lose data.
- Library reads the data, writes it to disk, and then attempts to update the
on-Freenet search index directly. It only uploads the btree nodes that have
actually been updated, or their parent nodes.
- This is all based on infinity0's work last summer and since then on scalable
forkable/copy-on-write on-freenet btree indexes. This is a data structure that
will probably have wider applications (e.g. wiki's), but it is immensely useful
here.
- The on-freenet btree indexes are scalable, unlike the old format. They will
use a depth of more than 2 if necessary, although have fairly large nodes for a
good fan-out. The top levels can be (but aren't yet) prefetched to improve
performance.
- In future, the on-freenet btree indexes will include the redundant splitfile
metadata of each layer in the layer above, making them immune to the loss of
any single key apart from the USK at the top. This will cost us some fan-out
but is acceptable.
- In future, the on-freenet btree indexes will likely be structured in such a
way that we can fetch a single block (on the critical path, 2 more for
propagation) for each layer, in the common case where the block is available.
If the block isn't available we'll fetch the whole node. This combined with
prefetch should make for *REALLY* fast searching.
- The indexes are "forkable" or "copy on write". This allows both for updating
the tree on freenet, only uploading the changed nodes, and for *updating
somebody else's tree* (resulting in a different top CHK, but reusing all the
untouched old nodes).

What actually works now, and drawbacks:
- I have not yet released the new XMLSpider and Library. I have been testing
and debugging and ran into many issues. But you can try them out. I will
release them soon if it continues to work.
- The indexes are uploaded to a CHK, a USK is then inserted to redirect to it.
- We cache the uploaded data in library-spider-pushed-data-cache/. This
directory is *NOT GARBAGE COLLECTED* at present. You can however just delete it
if you are confident fred will be able to re-fetch all the index data from the
network.
- Index data is stored in library.index.data.[number], data from a clean
shutdown from XMLSpider is stored in xmlspider.saved.data, the current edition
is in library.index.edition, the last pushed index as a CHK is in
library.index.lastpushed.chk, the public and private keys are
inlibrary.index.pubkey and library.index.privkey. None of these files are
encrypted. Neither is the XMLSpider database, or the Library bookmarks list.
These should be added to the relevant bug.
- It takes approximately 90 minutes to do a progressive insert for 16MB of data
from the spider. This is ridiculously slow when the spider first starts
working, fetching most stuff from the store. However, later on, when it is
mostly failing and retrying, it may be acceptable. I dunno, I have not yet
reached this point in my testing due to bugs! The last 3 successful inserts
here took 86, 90 and 51 minutes respectively.
- This happens in parallel with the spider, but if the lag gets too much
(currently 5 such chunks), the spider ends up blocking, waiting for the chunks
to be inserted.
- On an unclean shutdown, we re-fetch and re-parse the pages that we fetched
since the last time we uploaded data to Library.
- XMLSpider still indexes all the data in its database, and is therefore able
to generate old-format indexes if required. New format index support must be
enabled by setting the buffer size, which can be 0 for disabled, up to 128MB.
Note that the actual memory usage is probably significantly larger than the
buffer size estimate! I'm using a setting of 16MB for testing, with 4GB node
max memory. You should set the config option before/when you set the number of
fetches, on a new database. There is no support for migrating old databases: we
will only write new format indexes from the progressive data mentioned above.
- Support for old-format indexes means XMLSpider does a good deal more work
(CPU, disk, memory) than it needs to. This will be removed eventually, or
perhaps made configurable or forked.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL:
<https://emu.freenetproject.org/pipermail/tech/attachments/20100512/61241e54/attachment.pgp>

[Tech] Status of new format spider indexes

Reply via email to