On Monday 14 November 2005 09:02, Matthew Toseland wrote:
> 0.7 will not have streams. However, it could have searching support. It
> would be straightforward - a few days work - to write client code which
> downloads published indexes from a series of known spiders, and allows
> the user to search the indexes. Now, this would not scale if they all
> have to be prefetched. But for 0.7.0, it doesn't need to. If we abandon
> the requirement to download all the indexes in advance, and download
> them on demand (if the index is large, which it won't be to start with),
> although things slow down significantly, it will still be possible.
> Hopefully it won't be too bad in terms of speed, and even if it is
> really slow, if we take this into account in interface design it doesn't
> have to be too painful; searches can run in the background, just like
> downloads.
> 
> Advantages:
> - Every user sooner or later asks "why isn't freenet searchable?".
> - Most users who know why would still like it to be searchable.
> - Scales better than index sites. Spiders with something resembling
>   Google PageRank can run entirely automated.
> - Easier to use, and more familiar.
> - We don't have to provide initial links. The user expects that if they
>   type in "dolphin porn", they'll get what they asked for, just like
>   they do on Google on Kazaa. We might want to provide some anyway, for
>   0.7.0, but it'd be a step in the right direction.
> 
> One obvious disadvantage is that users will search for something, won't
> find it, and will assume freenet is crap. :)
> 
> Proposed format specification:
> 
> Version 0 (monolithic):
> 
> Some sort of metadata doctype telling us that this is a search index of
>    type 0.
> Number of keywords (int)
> Number of URIs (int)
> URIs (see below)
> Keywords (see below)
> 
> 1 URI:
> <URI> (string, prefixed with length)
> <optional metadata> (2 bytes number of fields, for each field: 4 bytes
>            field type, 2 bytes field length, field content)
> 
> 1 keyword:
> <keyword> - string (prefixed with length, has a maximum length)
> Number of entries (int)
> Entries (in priority order)
> 
> 1 Entry:
> <URI ID> - int, refers to a URI
> <number of references>
> <reference word numbers> - the word number in the text (all HTML is
>      stripped first); this is used for adjacent-words searches.
>      (integer)
> 
> Version 1 (non-monolithic):
> 
> Search manifest file:
> Metadata indicating this is a search manifest.
> Base URI.
> Number of letters of search term to use.
> Optional DBR details. (so we can combine the search manifest with the
>     DBR manifest).
> 
> 
> So:
> We want to search for "freenet censorship".
> We fetch the manifest file (well, we already have it... and we have 4 of
> them).
> The manifest file says:
> - DBR: offset 0, period 1 day
> - Use 2 base letters
> - Base URI is SSK at xlxlxlxlx,abc/joesearch-
> 
> We therefore construct the URI to fetch:
> SSK at xlxlxlxlx,abc/joesearch-<today>-<first two letters>
> = SSK at xlxlxlxlx,abc/joesearch-200612120000-fr
> 
> We fetch the search manifest successfully. Between "freedom" and "french",
> we find 5000 entries referring to freenet. We do the same thing for
> "censorship". Then we simply cross reference the entries for the two.
> First we find all the URIs which occur in both sets. Then we check each
> to find if the words are adjacent. Any in which the words are adjacent,
> we return, in order of their average priority in the two keyword blocks.
> (Priority is simply encoded by the sequence of the entries in the keyword
> block; naive implementations can just put them in randomly, more
> sophisticated spiders can do clever things like PageRank).
> 
> 
> Now, how long will this take to implement?
> 
> I estimate, reasonably conservatively:
> - 2 days to implement and debug the client interface (assuming we have
>   fproxy etc).
> - 2 days to code an FCP API to insert search metadata.
> - 2 days to write a spider with support for creating such metadata.

Having used freenet for years I think this is a very good feature to add.  

Ed Tomlinson

Reply via email to