0.7 will not have streams. However, it could have searching support. It
would be straightforward - a few days work - to write client code which
downloads published indexes from a series of known spiders, and allows
the user to search the indexes. Now, this would not scale if they all
have to be prefetched. But for 0.7.0, it doesn't need to. If we abandon
the requirement to download all the indexes in advance, and download
them on demand (if the index is large, which it won't be to start with),
although things slow down significantly, it will still be possible.
Hopefully it won't be too bad in terms of speed, and even if it is
really slow, if we take this into account in interface design it doesn't
have to be too painful; searches can run in the background, just like
downloads.
Advantages:
- Every user sooner or later asks "why isn't freenet searchable?".
- Most users who know why would still like it to be searchable.
- Scales better than index sites. Spiders with something resembling
Google PageRank can run entirely automated.
- Easier to use, and more familiar.
- We don't have to provide initial links. The user expects that if they
type in "dolphin porn", they'll get what they asked for, just like
they do on Google on Kazaa. We might want to provide some anyway, for
0.7.0, but it'd be a step in the right direction.
One obvious disadvantage is that users will search for something, won't
find it, and will assume freenet is crap. :)
Proposed format specification:
Version 0 (monolithic):
Some sort of metadata doctype telling us that this is a search index of
type 0.
Number of keywords (int)
Number of URIs (int)
URIs (see below)
Keywords (see below)
1 URI:
<URI> (string, prefixed with length)
<optional metadata> (2 bytes number of fields, for each field: 4 bytes
field type, 2 bytes field length, field content)
1 keyword:
<keyword> - string (prefixed with length, has a maximum length)
Number of entries (int)
Entries (in priority order)
1 Entry:
<URI ID> - int, refers to a URI
<number of references>
<reference word numbers> - the word number in the text (all HTML is
stripped first); this is used for adjacent-words searches.
(integer)
Version 1 (non-monolithic):
Search manifest file:
Metadata indicating this is a search manifest.
Base URI.
Number of letters of search term to use.
Optional DBR details. (so we can combine the search manifest with the
DBR manifest).
So:
We want to search for "freenet censorship".
We fetch the manifest file (well, we already have it... and we have 4 of
them).
The manifest file says:
- DBR: offset 0, period 1 day
- Use 2 base letters
- Base URI is SSK at xlxlxlxlx,abc/joesearch-
We therefore construct the URI to fetch:
SSK at xlxlxlxlx,abc/joesearch-<today>-<first two letters>
= SSK at xlxlxlxlx,abc/joesearch-200612120000-fr
We fetch the search manifest successfully. Between "freedom" and "french",
we find 5000 entries referring to freenet. We do the same thing for
"censorship". Then we simply cross reference the entries for the two.
First we find all the URIs which occur in both sets. Then we check each
to find if the words are adjacent. Any in which the words are adjacent,
we return, in order of their average priority in the two keyword blocks.
(Priority is simply encoded by the sequence of the entries in the keyword
block; naive implementations can just put them in randomly, more
sophisticated spiders can do clever things like PageRank).
Now, how long will this take to implement?
I estimate, reasonably conservatively:
- 2 days to implement and debug the client interface (assuming we have
fproxy etc).
- 2 days to code an FCP API to insert search metadata.
- 2 days to write a spider with support for creating such metadata.
--
Matthew J Toseland - toad at amphibian.dyndns.org
Freenet Project Official Codemonkey - http://freenetproject.org/
ICTHUS - Nothing is impossible. Our Boss says so.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL:
<https://emu.freenetproject.org/pipermail/tech/attachments/20051114/e231c42c/attachment.pgp>