Hi Adam,
thanks for writing this proposal. It was time that someone caught all these
searching ideas floating around and put them in one nice long mail.
I have to add that I'm confident that keyword searches (with exact matching !)
are the way to go, because only this allows for smart routing of searches.
But i like your idea of additionally selecting matches using wildcard metadata
matches.
So now for my (hopefully constructive) comments:
> Also, we only need to generate three different searches. One for each
> keyword. As some wise soul suggested earlier, we really need to
> include
> the other keywords that are being searched for within each of the
> broken
> up searches. But only route the searches with a single keyword.
I absolutely agree. Also see my little idea below.
>
> Let us assume that each node contains a hashtable, indexed by the
> SHA1 hash of the keyword. The hashtable contains the metadata and the
> associated key. My only problem with this is that our search engine
> should also be encrypted. My call is to use the second SHA1 hash of
> the
> keyword as an index (the SHA1 hash of the SHA1 hash of the keyword)
> in the
> table, and encrypt the data with the first hash of the data. The
> keyword
> should accompany the message in plaintext so that the node can read
> the
> encrypted Metadata before forgetting the key.
Actually it would be enough to send the first level hash with the message
(called
KeyHash below), the node could still decrypt the metadata. You would also route
the
message by the KeyHash. I think what you mean is something like the following:
Format of the Hashtable:
Index Content
SHA1(KeyHash) crypt(MetaData,SHA1(Keyhash + nodeconst))
KeyHash := SHA1(Plaintext Keyword)
nodeconst : a node-specific constant (i just slipped this idea in there)
Upon receiving the Search message, the node would look if it can find
SHA1(Keyhash)
in its Hashtable, and if that's the case, calculate SHA1(Keyhast + nodeconst),
decrypt the metadata, look if everything matches, and send the reply (probably
including the metadata) back the chain.
Now I see a problem with this:
What if there are multiple keywords ? You can't encrypt by all of them, because
a search won't necessaryly include all. Ofcourse you can argue that there's the
"primary" keyword by which the search is routed towards our node, but you can't
really be sure you won't get a search routed by a different "primary" keyword.
So you need to be able to decrypt the metadata given any single keyword.
This can be done (in a way similar GPG handles multiple recipients) but it will
be more complex to implement.
Anyway, this is node-specific behaviour, so I guess we can always implement
the encryption part at a later stage.
> This single broken off keyword (per search) (primary) should be what
> is
> used to look up information in a node's hashtable. (Yes, I have a sick
> fixation on hashtables) Then, the other keywords (secondary) that were
> sent along with the search should be compared to the only mandatory
> entry
> in the metadata, which is "Keywords". If all of the keywords can be
> found
> within this field, the entry is added to the list of results.
> Otherwise it
> is ignored.
As I said, you can't really be sure which keyword is the "primary" by which you
get
searches routed towards you. Plus, there should be a more efficient way than
decrypting the whole metadata and then figuring out that not all keywords
match.
But since this is also just a implementation issue, just do as you want but
don't ask
me to do the same :).
> We zip through the list culling useless entries based on the two
> criteria:
>
> #1 - We check the "Keyword" field
> #2 - We check user specified fields (Possibly add more then just
> equality)
>
> Based on the first criteria, we strip out Aphex's various other
> songs, and
> based on the second, we are able to strip out other, unwanted, media
> types.
Agreed. #2 could be some kind of regex that has to match the metadata.
But that means that it will have to travel plaintext in the message.
> #4 I propose that we disallow common keywords (eg. asf, mp3, txt, text, etc)
> by stripping them at the node level. This will prevent absurd
> searches from
> being carried out. Content type culling / searches should be carried
> out
> at a metadata level.
Plus, have a MaxMatches field. Decrement it _before_ sending the message on for
the
matches you can provide yourself.
> #5 A quick note. Search results should be cached like any other Freenet
> data.
Yep !
> Also. Why must we give up smart routing. Although Brandon argues that
> we
> cannot route searches, why not? Take a look at my points #1 and #2. It
> seems to me that under those conditions, individual keywords could be
> routed without a problem. They would be *exactly* like normal freenet
> data in their propogation. So if I can find a given freenet key in a
> reasonable HTL, I should logically ( given that keywords are simply a
> special case of a Freenet Key ) be able to find the metadata for a
> given
> keyword in the same way.
Absolutely. This is the way to go. You just can't have partial matches. But
with keywords, you don't need them. You just search for some of the keywords.
>
> So. We use break a search into multiple searches (one per keyword) and
> route them as we would normal Freenet Keys. They are smart routed
> until
> their HTL expires, and are not executed in parallel. The user stops
> sending
> out permutations of their search phrase once they feel like they have
> enough results. Each node returns a list of keys along with whatever
> metadata is associated. These entries are cached along the return path
> just like normal Freenet data.
>
Again, I agree.
However, I think it might be worth consideration to allow a "broad" search -
that
is, splitting up the search at every node, not just the first one. So if we
search
for three keywords, every node will divide HTL and MaxMatches by three and send
the message on to three of it's neighbours (the three best guesses for each of
the keywords).
Just an idea, but IMHO not really necessary - since we use exact matching on
keywords, just three "deep" search chains should also work.
> I think that it could work, and I don't see anything glaringly wrong with it
> given the debate that I've read so far. But poke lots of holes in it and i'll
> try and scramble to fix it. Have fun.
Nothing wrong at all. The idea ist great. Mainly the implementation part is
where
my opinion differs a little.
Bye,
Philipp
_______________________________________________
Freenet-dev mailing list
Freenet-dev at lists.sourceforge.net
http://lists.sourceforge.net/mailman/listinfo/freenet-dev