On Monday 14 November 2005 09:02, Matthew Toseland wrote: > 0.7 will not have streams. However, it could have searching support. It > would be straightforward - a few days work - to write client code which > downloads published indexes from a series of known spiders, and allows > the user to search the indexes. Now, this would not scale if they all > have to be prefetched. But for 0.7.0, it doesn't need to. If we abandon > the requirement to download all the indexes in advance, and download > them on demand (if the index is large, which it won't be to start with), > although things slow down significantly, it will still be possible. > Hopefully it won't be too bad in terms of speed, and even if it is > really slow, if we take this into account in interface design it doesn't > have to be too painful; searches can run in the background, just like > downloads. > > Advantages: > - Every user sooner or later asks "why isn't freenet searchable?". > - Most users who know why would still like it to be searchable. > - Scales better than index sites. Spiders with something resembling > Google PageRank can run entirely automated. > - Easier to use, and more familiar. > - We don't have to provide initial links. The user expects that if they > type in "dolphin porn", they'll get what they asked for, just like > they do on Google on Kazaa. We might want to provide some anyway, for > 0.7.0, but it'd be a step in the right direction. > > One obvious disadvantage is that users will search for something, won't > find it, and will assume freenet is crap. :) > > Proposed format specification: > > Version 0 (monolithic): > > Some sort of metadata doctype telling us that this is a search index of > type 0. > Number of keywords (int) > Number of URIs (int) > URIs (see below) > Keywords (see below) > > 1 URI: > <URI> (string, prefixed with length) > <optional metadata> (2 bytes number of fields, for each field: 4 bytes > field type, 2 bytes field length, field content) > > 1 keyword: > <keyword> - string (prefixed with length, has a maximum length) > Number of entries (int) > Entries (in priority order) > > 1 Entry: > <URI ID> - int, refers to a URI > <number of references> > <reference word numbers> - the word number in the text (all HTML is > stripped first); this is used for adjacent-words searches. > (integer) > > Version 1 (non-monolithic): > > Search manifest file: > Metadata indicating this is a search manifest. > Base URI. > Number of letters of search term to use. > Optional DBR details. (so we can combine the search manifest with the > DBR manifest). > > > So: > We want to search for "freenet censorship". > We fetch the manifest file (well, we already have it... and we have 4 of > them). > The manifest file says: > - DBR: offset 0, period 1 day > - Use 2 base letters > - Base URI is SSK at xlxlxlxlx,abc/joesearch- > > We therefore construct the URI to fetch: > SSK at xlxlxlxlx,abc/joesearch-<today>-<first two letters> > = SSK at xlxlxlxlx,abc/joesearch-200612120000-fr > > We fetch the search manifest successfully. Between "freedom" and "french", > we find 5000 entries referring to freenet. We do the same thing for > "censorship". Then we simply cross reference the entries for the two. > First we find all the URIs which occur in both sets. Then we check each > to find if the words are adjacent. Any in which the words are adjacent, > we return, in order of their average priority in the two keyword blocks. > (Priority is simply encoded by the sequence of the entries in the keyword > block; naive implementations can just put them in randomly, more > sophisticated spiders can do clever things like PageRank). > > > Now, how long will this take to implement? > > I estimate, reasonably conservatively: > - 2 days to implement and debug the client interface (assuming we have > fproxy etc). > - 2 days to code an FCP API to insert search metadata. > - 2 days to write a spider with support for creating such metadata.
Having used freenet for years I think this is a very good feature to add. Ed Tomlinson
