A collection of questions that are going through my mind about searching
Freenet. (Yes. Again. Sorry. ;-)
UPDATED:
(which turned into a proposal whilst i was thinking. heh)
#1 Is there a good reason not to break up search strings into their composite
keywords?
Ex: "aphex twin pulsewidth" would be broken into 4 keywords. Any data
that is found on a node should be the AND of the results of all these
keywords.
#2 Is there really a difference between using the SHA1 hash of "apehx" and
Ian's quality of regexp comparison? As long as we apply *some* standard
for determining closeness, does the standard really matter? Granted, the
message wouldn't be routed by the _logical_ closeness, but by the
seemingly
random closeness of cryptographic keys. This is how freenet routes data
though isn't it? By the arbitrary metric of the closeness of SHA1 hashes
of the Data item's freenet Key?
It seems to me that it would be desireable to not group keys by their
logic/lexical closeness, as it would wipe out an entire class of keys
if a node went down. Also, this would provide the added benefit of
hiding from the node owner the content of his keys.
#3 Metadata. Icky. I'm prolly going to be burned at the stake for bringing this
one up :-) Oh well.
First I'll state the obvious. We need it to cull searches to a
reasonable
point. Transmitting a huge number of search results over the Freenet is
not
a terribly wonderful idea, IMOHO. Searching the Freenet with keywords
like
txt or mp3 is even dumber. The sheer number of results that this could
return is frightening. The answer the occurs to me, is to allow metadata
within the stored search entries. (Yes I know we don't have stored
search
entries. But I really want them. *G* More on that in a bit.)
For the record, I see the need for both Metadata and MIME-like headers.
But as far as searching goes, Metadata seems to be the way to go.
We need Metadata with arbitrary headers, but with a good agreed on
standard.
Like some wise fellow said earlier, if the web had started with good
metadata and everyone *used* it, searching would be a lot easier.
Let us continue my example.
Some fellow (or lady of course. Don't want to exclude anyone) had
inserted
an mp3 into the Freenet. Shame on them. They entered it with the key:
mp3/aphex twin/pulsewidth
And being a good little Freenetizen, they abided by Ian's recomendation
for
naming their key, and also added some search entries along with it as
follows: (Note that I really don't care what format these are in. XML
is not
holy cause for me. To be honest, I kinda like email-style headers.)
Keywords: aphex twin pulsewidth
Content-Type: audio/mpeg
Artist: Aphex Twin
Title: Pulsewidth
Bitrate: 128k
Inserted-By: Joe Luser <joe at luser.net>
Public-Key: 1234 5678 9012 3456
...
Now culling searches becomes a lot nicer. Let us search for the keywords
"aphex" and "twin" and "pulsewidth" and add a note in the search message
to only return results with Content-Type == audio/mpeg. (Or even
audio/*)
Also, we only need to generate three different searches. One for each
keyword. As some wise soul suggested earlier, we really need to include
the other keywords that are being searched for within each of the broken
up searches. But only route the searches with a single keyword.
Let us assume that each node contains a hashtable, indexed by the
SHA1 hash of the keyword. The hashtable contains the metadata and the
associated key. My only problem with this is that our search engine
should also be encrypted. My call is to use the second SHA1 hash of the
keyword as an index (the SHA1 hash of the SHA1 hash of the keyword) in
the
table, and encrypt the data with the first hash of the data. The keyword
should accompany the message in plaintext so that the node can read the
encrypted Metadata before forgetting the key.
This single broken off keyword (per search) (primary) should be what
is
used to look up information in a node's hashtable. (Yes, I have a sick
fixation on hashtables) Then, the other keywords (secondary) that were
sent along with the search should be compared to the only mandatory
entry
in the metadata, which is "Keywords". If all of the keywords can be
found
within this field, the entry is added to the list of results. Otherwise
it
is ignored.
Back to my example.
Routing conciderations will be ignored for the moment. They *will* be
talked about in a bit, however, so don't give up yet :-) All we are
worried about in this example is the behaviour of a single node as it
looks for search results.
Let us assume that a given node received "aphex" as its primary key and
"twin" and "pulsewidth" as secondary keys. The node takes a peak into
its
hashtable of search data:
preculling_result_list = search_index["aphex"];
We zip through the list culling useless entries based on the two
criteria:
#1 - We check the "Keyword" field
#2 - We check user specified fields (Possibly add more then just
equality)
Based on the first criteria, we strip out Aphex's various other songs,
and
based on the second, we are able to strip out other, unwanted, media
types.
We are left with a handful of results, most of which will be duplicate
entries with different keys.
Certainly the culling will not be this effective on other document
types,
or for more vague searches, "manifesto" for example. But this will help
cut down the number of responses.
#4 I propose that we disallow common keywords (eg. asf, mp3, txt, text, etc)
by stripping them at the node level. This will prevent absurd searches
from
being carried out. Content type culling / searches should be carried out
at a metadata level.
#5 A quick note. Search results should be cached like any other Freenet
data.
#6 Routing. Dreaded routing. (Why my last proposal was shot down :-)
After reading Brandon's proposal for "Why Broadcast Routing Will Work"
I
started to think again. (Dangerous, i know) Why not? His proposal seems
quite resonable. By, in effect, forcing searches to play "nice" (to use
a
UNIX-ish term) we avoid a lot of the stress on the network that
broadcast
searches would cause. Combined with short and independant queues, to
just
drop floods, and carefully chosen bandwidth limitations, I believe that
it
could work just fine.
Also. Why must we give up smart routing. Although Brandon argues that we
cannot route searches, why not? Take a look at my points #1 and #2. It
seems to me that under those conditions, individual keywords could be
routed without a problem. They would be *exactly* like normal freenet
data in their propogation. So if I can find a given freenet key in a
reasonable HTL, I should logically ( given that keywords are simply a
special case of a Freenet Key ) be able to find the metadata for a given
keyword in the same way.
So. We use break a search into multiple searches (one per keyword) and
route them as we would normal Freenet Keys. They are smart routed until
their HTL expires, and are not executed in parallel. The user stops
sending
out permutations of their search phrase once they feel like they have
enough results. Each node returns a list of keys along with whatever
metadata is associated. These entries are cached along the return path
just like normal Freenet data.
Well. There you have it. Long as hell. Boring to read. Fun to write though :-)
I think that it could work, and I don't see anything glaringly wrong with it
given the debate that I've read so far. But poke lots of holes in it and i'll
try and scramble to fix it. Have fun.
Regards,
Adam Lydick
--
Freenet -- Re-Wiring the Internet
http://freenet.sourceforge.net
My Node: tcp/rivendell.yi.org:19114
_______________________________________________
Freenet-dev mailing list
Freenet-dev at lists.sourceforge.net
http://lists.sourceforge.net/mailman/listinfo/freenet-dev