The current index works, but it is seriously broken in terms of first time 
usability. Specifically, because of a bug in XMLSpider that has been fixed, but 
unfortunately wAnnA hasn't been updated, it has a completely insane structure:
- 15x indexes 1-f of size tens of megabytes. If the word is in one of these 
(which it is 93% of the time), the librarian must download 200+ blocks before 
displaying results. The upside is these are very popular, but even on a new 
node with lots of connections it takes considerable time to fetch them.
- Approx 256x indexes 0[0-9a-f][0-9a-f], which are tiny. Most of these are also 
managing to persist, but not all of them.

Because the spider takes weeks to fetch everything from scratch, it is unlikely 
that anyone will be able to insert a new index before we need to release. The 
obvious option is to take the existing data and reorganise it into smaller 
chunks, but if I was to do that I would be inserting an index, and I don't 
think that's a good idea, because 1) it exposes us to additional legal risk, 
doesn't it? and 2) I definitely don't want to run the spider in the long term, 
unless it's vital, and that would definitely be legally risky; and publishing 
it once would mean we had to move XMLLibrarian to my SSK...

Solutions? If any anonymous person happens to have a big librarian index, or is 
able to do the reorganisation I mentioned (you need to split the words by md5, 
and you need to put in the site data for only the words that are in that 
subindex), inserting a new librarian index and announcing it anonymously would 
be *really* helpful right now.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20090606/48acbdde/attachment.pgp>

Reply via email to