I have been working on integrating infinity0's new format search indexes with XMLSpider.
Old format indexes: - XMLSpider spiders Freenet, starting with the bookmarks, and finds all text-based pages (HTML and plain text currently). - XMLSpider indexes all words found on text-based pages, in its internal Perst database. - When the user asks XMLSpider to generate an index, it reads its database, involving a vast amount of seeking, and generates an old-format index. This can take a week or more. - Basically an old format index is an XML file with a list of sub-files, split by the first few characters of the hexadecimal version of the md5 of the word. Each sub-file, also XML, contains a list of files followed by a list of words. The sub-files and the main index file can both get huge, and the format is fundamentally not very scalable. They are all inserted in a single USK manifest. One possible alternative would be to not insert it as a site, just post the number of hex digits rather than the actual split, and rely on SSKs, but there are problems with this too - poor balancing, reliability, etc. - The old-format index is published as a (gigantic) directory; the user must insert it manually. You can see an old format index here: USK at 5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index.xml And an example sub-index within it: USK at 5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index_00.xml The new solution: - Instead of (currently as well as) writing the index to its database, XMLSpider buffers indexed page data in RAM and sends it to Library when it passes a given threshold. - On an orderly shutdown, the data buffered in RAM but not yet sent will be written to disk. On a disorderly shutdown it is possible that XMLSpider will lose data. - Library reads the data, writes it to disk, and then attempts to update the on-Freenet search index directly. It only uploads the btree nodes that have actually been updated, or their parent nodes. - This is all based on infinity0's work last summer and since then on scalable forkable/copy-on-write on-freenet btree indexes. This is a data structure that will probably have wider applications (e.g. wiki's), but it is immensely useful here. - The on-freenet btree indexes are scalable, unlike the old format. They will use a depth of more than 2 if necessary, although have fairly large nodes for a good fan-out. The top levels can be (but aren't yet) prefetched to improve performance. - In future, the on-freenet btree indexes will include the redundant splitfile metadata of each layer in the layer above, making them immune to the loss of any single key apart from the USK at the top. This will cost us some fan-out but is acceptable. - In future, the on-freenet btree indexes will likely be structured in such a way that we can fetch a single block (on the critical path, 2 more for propagation) for each layer, in the common case where the block is available. If the block isn't available we'll fetch the whole node. This combined with prefetch should make for *REALLY* fast searching. - The indexes are "forkable" or "copy on write". This allows both for updating the tree on freenet, only uploading the changed nodes, and for *updating somebody else's tree* (resulting in a different top CHK, but reusing all the untouched old nodes). What actually works now, and drawbacks: - I have not yet released the new XMLSpider and Library. I have been testing and debugging and ran into many issues. But you can try them out. I will release them soon if it continues to work. - The indexes are uploaded to a CHK, a USK is then inserted to redirect to it. - We cache the uploaded data in library-spider-pushed-data-cache/. This directory is *NOT GARBAGE COLLECTED* at present. You can however just delete it if you are confident fred will be able to re-fetch all the index data from the network. - Index data is stored in library.index.data.[number], data from a clean shutdown from XMLSpider is stored in xmlspider.saved.data, the current edition is in library.index.edition, the last pushed index as a CHK is in library.index.lastpushed.chk, the public and private keys are inlibrary.index.pubkey and library.index.privkey. None of these files are encrypted. Neither is the XMLSpider database, or the Library bookmarks list. These should be added to the relevant bug. - It takes approximately 90 minutes to do a progressive insert for 16MB of data from the spider. This is ridiculously slow when the spider first starts working, fetching most stuff from the store. However, later on, when it is mostly failing and retrying, it may be acceptable. I dunno, I have not yet reached this point in my testing due to bugs! The last 3 successful inserts here took 86, 90 and 51 minutes respectively. - This happens in parallel with the spider, but if the lag gets too much (currently 5 such chunks), the spider ends up blocking, waiting for the chunks to be inserted. - On an unclean shutdown, we re-fetch and re-parse the pages that we fetched since the last time we uploaded data to Library. - XMLSpider still indexes all the data in its database, and is therefore able to generate old-format indexes if required. New format index support must be enabled by setting the buffer size, which can be 0 for disabled, up to 128MB. Note that the actual memory usage is probably significantly larger than the buffer size estimate! I'm using a setting of 16MB for testing, with 4GB node max memory. You should set the config option before/when you set the number of fetches, on a new database. There is no support for migrating old databases: we will only write new format indexes from the progressive data mentioned above. - Support for old-format indexes means XMLSpider does a good deal more work (CPU, disk, memory) than it needs to. This will be removed eventually, or perhaps made configurable or forked. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/tech/attachments/20100512/61241e54/attachment.pgp>
