RE: [freenet-dev] Search Indexing Round 2

Niklas Bergh Sun, 17 Aug 2003 02:34:56 -0700


> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Scott Young
> Sent: den 17 augusti 2003 04:17
> To: [EMAIL PROTECTED]
> Subject: Re: [freenet-dev] Search Indexing Round 2
> 
> 
> On Sat, 2003-08-16 at 21:23, Jeremy Caleb Heffner wrote:
> > I wasn't really referring to inserting the index as a typical key.  
> > The reason why not is because I think this idea is based upon 
> > combining indexes so that they slowly grow and 'centralize' 
> in a way, 
> > which I don't think is a good idea in a distributed system 
> like this.
> > 
> > What I was referring to was instituting a new Search 
> message type in 
> > the Freenet protocol (think somewhat like how Gnutella does 
> this, but 
> > our own version of course).
> > 
> > The way this message would work is very similar to the way 
> a chain is 
> > formed to pass key data back to the requester, which preserves the 
> > anonymity of the searcher.  Caching the search results 
> along the chain 
> > also protects the privacy of the local index because it 
> would contain 
> > both locally indexed data and indexes it relayed.  Each 
> time a keyword 
> > was searched for its results would be distributed to more nodes, 
> > expanding the ability to search for that keyword and lessen the 
> > bandwidth consumed for X number of results.
> > 
> > Would this work? (I am not claming to be an expert of any 
> kind, just 
> > throwing out an idea for a searching system that doesn't rely on 
> > conglomerating indexes.)
> > 
> > Jeremy
> 
> 
> 
> You should look up FASD.  FASD is a search mechanism that 
> someone designed a few years ago so that searching could be 
> done in freenet, but the design has laid dormant for a while. 
>  It used a cosine correlation function to determine 
> "closeness" to certain metadata.
> 
> The problem I see with a metadata-key system is that it 
> suffers the same problem as the META tags search engines used 
> to index sites.  How are you going to prove that the indexes 
> are honest and correct?  FASD wanted to make the metadata 
> used for querying decentralized from an insertion standpoint. 
>  That is, publishers were responsible for inserting metadata 
> into freenet.  This means that you have to trust the metadata 
> keys that were inserted.  FASD does have a culling mechanism 
> so that metadata could be validated and deleted, but this 
> system seems like it would be expensive to execute on a large 
> scale network.
> 
> 
> The idea I have for a search function is to have different 
> search engine sites in freenet.  The search engine 
> maintainers would gather data from freenet by 
> spidering/hand-picking/doing whatever they feel like to 
> generate this index.  When a user goes to this site, a submit 
> form sends a command to a client program (probably integrated 
> with FProxy) to execute a search across the indexes.
> 
> Indexes are stored in the following format:
> 
> [EMAIL PROTECTED]
> 
> where Keyword would be a listing of certain sites that would 
> correlate to that keyword, along with "weights" for each 
> site.  A large search index would have many of these keyword 
> pages. So a search for "movies" would look for


> [EMAIL PROTECTED]
> 
> which might contain the following:
> 
> 17 [EMAIL PROTECTED]
> 15 [EMAIL PROTECTED]
> 7 [EMAIL PROTECTED]
> 5 [EMAIL PROTECTED]
> 3 [EMAIL PROTECTED]

I think it would be better if the client application calculated the
scoring. I assume that the 'weight' you mention here is somehow
calculated from where in the page the word was found and how 'valuable'
in the given page the word is and so on..

I would recommend that the index file contained this information instead
<Information Domain> <Relative position> <KEY>

****<Relative position>*****
Where relative position is an integer telling at which offset within the
specified information domain the word was found. This enables 'phrase
seaching'/'word sequence searching', like "internet images" which would
evaluate to 'The word internet at a position where the next position
contains the word 'images'.
****</Relative position>*****

****<KEY>*****
<KEY> is the resource where the word originates from.
****</KEY>*****

****<Information Domain>*****
Where <Information Domain> is things like 'Title', 'Body', 'keywords',
'Name' and so on. The different information domains present in the index
would be choosen by the index-creator/generator and could be published
at '[EMAIL PROTECTED] domains'. Where the
format is something like:
<Domain name> <Count>

Where <Domain name> is the name of the domain and <Count> is the total
number of words found in the domain.

Having a list of 'Information Domains' would allow the search client to
"look in 'name' only" or "look in 'title' or 'body' only". Having the
'Count' of words in the domain name could be used as modifier to the
score calculation for a hit ('Title' hits are more rare than 'Body' hits
and should therefore be considered as more valuable than the 'body'
hits)
****</Information Domain>*****

Additionally there should be a file at
'[EMAIL PROTECTED]' which should contain a
'weight' for this particular word where 'weight' is defined as Number of
"occurrances of 'movies'"/"Total number of words in the index" (word
rareness). This would allow the search engine to know that, for
instance, the word 'the' or the word 'in' are quite less valuable than
'movies' or 'internet' and adjust the final hit-scores accordingly.

> Multiple-keyword searches could be done by requesting the 
> index page for all of those keywords, and take the 
> intersection of all of those indexes.  The weights for a 
> specific site across all of those pages could be combined by 
> some simple formula (such as addition).

Yes, I agree. And the modifications above enables phrase-searches and
useful stuff like "search in 'filename' only" too. I think that what you
outlined above is the best suggestion for how things should work up to
this date.

I think that the things I ontlined above also makes multiple indexes
search possbile. Having only a index local 'weight' would make it very
hard to establish a common baseline for hits from multiple indexes.
Having the seach engine calculate the score based on the actual
'rareness' of the word would make this much easier.

If someone is interested there also existes a slight (and quite easy)
extension to the scheme above that allows for spelling mistake tolerance
and wordforms (like search, searches, searching and searched) tolerance
too.

/N

_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

RE: [freenet-dev] Search Indexing Round 2

Reply via email to