[Freenet-dev] Searching. Questions and a Proposal (again)

awlydick Tue, 18 Apr 2000 12:19:25 -0500

A collection of questions that are going through my mind about searching
Freenet. (Yes. Again. Sorry. ;-)


UPDATED:
(which turned into a proposal whilst i was thinking. heh)

#1 Is there a good reason not to break up search strings into their composite
   keywords?

        Ex: "aphex twin pulsewidth" would be broken into 4 keywords. Any data
        that is found on a node should be the AND of the results of all these
        keywords.

#2 Is there really a difference between using the SHA1 hash of "apehx" and
        Ian's quality of regexp comparison? As long as we apply *some* standard
        for determining closeness, does the standard really matter? Granted, the
        message wouldn't be routed by the _logical_ closeness, but by the 
seemingly
        random closeness of cryptographic keys. This is how freenet routes data
        though isn't it? By the arbitrary metric of the closeness of SHA1 hashes
        of the Data item's freenet Key? 

        It seems to me that it would be desireable to not group keys by their 
        logic/lexical closeness, as it would wipe out an entire class of keys 
        if a node went down. Also, this would provide the added benefit of
        hiding from the node owner the content of his keys. 

#3 Metadata. Icky. I'm prolly going to be burned at the stake for bringing this
        one up :-) Oh well.

        First I'll state the obvious. We need it to cull searches to a 
reasonable
        point. Transmitting a huge number of search results over the Freenet is 
not
        a terribly wonderful idea, IMOHO. Searching the Freenet with keywords 
like
        txt or mp3 is even dumber. The sheer number of results that this could
        return is frightening. The answer the occurs to me, is to allow metadata
        within the stored search entries. (Yes I know we don't have stored 
search
        entries. But I really want them. *G* More on that in a bit.)

        For the record, I see the need for both Metadata and MIME-like headers.
        But as far as searching goes, Metadata seems to be the way to go.
        We need Metadata with arbitrary headers, but with a good agreed on 
standard.
        Like some wise fellow said earlier, if the web had started with good
        metadata and everyone *used* it, searching would be a lot easier.

        Let us continue my example.

        Some fellow (or lady of course. Don't want to exclude anyone) had 
inserted
        an mp3 into the Freenet. Shame on them. They entered it with the key:

                mp3/aphex twin/pulsewidth

        And being a good little Freenetizen, they abided by Ian's recomendation 
for
        naming their key, and also added some search entries along with it as
        follows: (Note that I really don't care what format these are in. XML 
is not
        holy cause for me. To be honest, I kinda like email-style headers.)

        Keywords: aphex twin pulsewidth
        Content-Type: audio/mpeg
        Artist: Aphex Twin
        Title: Pulsewidth
        Bitrate: 128k
        Inserted-By: Joe Luser <joe at luser.net>
        Public-Key: 1234 5678 9012 3456
           ...

        Now culling searches becomes a lot nicer. Let us search for the keywords
        "aphex" and "twin" and "pulsewidth" and add a note in the search message
        to only return results with Content-Type == audio/mpeg. (Or even 
audio/*)

        Also, we only need to generate three different searches. One for each
        keyword. As some wise soul suggested earlier, we really need to include
        the other keywords that are being searched for within each of the broken
        up searches. But only route the searches with a single keyword. 

        Let us assume that each node contains a hashtable, indexed by the 
        SHA1 hash of the keyword. The hashtable contains the metadata and the
        associated key. My only problem with this is that our search engine
        should also be encrypted. My call is to use the second SHA1 hash of the
        keyword as an index (the SHA1 hash of the SHA1 hash of the keyword) in 
the
        table, and encrypt the data with the first hash of the data. The keyword
        should accompany the message in plaintext so that the node can read the
        encrypted Metadata before forgetting the key. 

        This single broken off keyword (per search) (primary)  should be what 
is 
        used to look up information in a node's hashtable. (Yes, I have a sick 
        fixation on hashtables) Then, the other keywords (secondary) that were
        sent along with the search should be compared to the only mandatory 
entry
        in the metadata, which is "Keywords". If all of the keywords can be 
found
        within this field, the entry is added to the list of results. Otherwise 
it
        is ignored. 

        Back to my example.

        Routing conciderations will be ignored for the moment. They *will* be 
        talked about in a bit, however, so don't give up yet :-) All we are
        worried about in this example is the behaviour of a single node as it
        looks for search results.

        Let us assume that a given node received "aphex" as its primary key and
        "twin" and "pulsewidth" as secondary keys. The node takes a peak into 
its
        hashtable of search data:

                preculling_result_list =  search_index["aphex"];

        We zip through the list culling useless entries based on the two 
criteria:

        #1 - We check the "Keyword" field
        #2 - We check user specified fields (Possibly add more then just 
equality) 

        Based on the first criteria, we strip out Aphex's various other songs, 
and
        based on the second, we are able to strip out other, unwanted, media 
types.
        We are left with a handful of results, most of which will be duplicate 
        entries with different keys.

        Certainly the culling will not be this effective on other document 
types,
        or for more vague searches, "manifesto" for example. But this will help
        cut down the number of responses.

#4 I propose that we disallow common keywords (eg. asf, mp3, txt, text, etc)
        by stripping them at the node level. This will prevent absurd searches 
from
        being carried out. Content type culling / searches should be carried out
        at a metadata level.

#5      A quick note. Search results should be cached like any other Freenet 
data.

#6 Routing. Dreaded routing. (Why my last proposal was shot down :-)

        After reading Brandon's proposal for "Why Broadcast Routing Will Work" 
I 
        started to think again. (Dangerous, i know) Why not? His proposal seems
        quite resonable. By, in effect, forcing searches to play "nice" (to use 
a
        UNIX-ish term) we avoid a lot of the stress on the network that 
broadcast
        searches would cause. Combined with short and independant queues, to 
just
        drop floods, and carefully chosen bandwidth limitations, I believe that 
it
        could work just fine. 

        Also. Why must we give up smart routing. Although Brandon argues that we
        cannot route searches, why not? Take a look at my points #1 and #2. It
        seems to me that under those conditions, individual keywords could be 
        routed without a problem. They would be *exactly* like normal freenet
        data in their propogation. So if I can find a given freenet key in a 
        reasonable HTL, I should logically ( given that keywords are simply a
        special case of a Freenet Key ) be able to find the metadata for a given
        keyword in the same way.

        So. We use break a search into multiple searches (one per keyword) and
        route them as we would normal Freenet Keys. They are smart routed until
        their HTL expires, and are not executed in parallel. The user stops 
sending
        out permutations of their search phrase once they feel like they have
        enough results. Each node returns a list of keys along with whatever
        metadata is associated. These entries are cached along the return path
        just like normal Freenet data.

Well. There you have it. Long as hell. Boring to read. Fun to write though :-)
I think that it could work, and I don't see anything glaringly wrong with it
given the debate that I've read so far. But poke lots of holes in it and i'll
try and scramble to fix it. Have fun.

Regards,

Adam Lydick

-- 
Freenet -- Re-Wiring the Internet
http://freenet.sourceforge.net
My Node: tcp/rivendell.yi.org:19114

_______________________________________________
Freenet-dev mailing list
Freenet-dev at lists.sourceforge.net
http://lists.sourceforge.net/mailman/listinfo/freenet-dev

[Freenet-dev] Searching. Questions and a Proposal (again)

Reply via email to