The current index works, but it is seriously broken in terms of first time usability. Specifically, because of a bug in XMLSpider that has been fixed, but unfortunately wAnnA hasn't been updated, it has a completely insane structure: - 15x indexes 1-f of size tens of megabytes. If the word is in one of these (which it is 93% of the time), the librarian must download 200+ blocks before displaying results. The upside is these are very popular, but even on a new node with lots of connections it takes considerable time to fetch them. - Approx 256x indexes 0[0-9a-f][0-9a-f], which are tiny. Most of these are also managing to persist, but not all of them.
Because the spider takes weeks to fetch everything from scratch, it is unlikely that anyone will be able to insert a new index before we need to release. The obvious option is to take the existing data and reorganise it into smaller chunks, but if I was to do that I would be inserting an index, and I don't think that's a good idea, because 1) it exposes us to additional legal risk, doesn't it? and 2) I definitely don't want to run the spider in the long term, unless it's vital, and that would definitely be legally risky; and publishing it once would mean we had to move XMLLibrarian to my SSK... Solutions? If any anonymous person happens to have a big librarian index, or is able to do the reorganisation I mentioned (you need to split the words by md5, and you need to put in the site data for only the words that are in that subindex), inserting a new librarian index and announcing it anonymously would be *really* helpful right now. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 835 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20090606/48acbdde/attachment.pgp>
