Re: [freenet-dev] Search Indexing Round 2

Niklas Bergh Sun, 17 Aug 2003 03:50:30 -0700

Let me correct some of the things I wrote, it was months since I worked with
this now :) Forget about the previous mail.

/N

> On Sat, 2003-08-16 at 21:23, Jeremy Caleb Heffner wrote:
> > I wasn't really referring to inserting the index as a typical key.
> > The reason why not is because I think this idea is based upon
> > combining indexes so that they slowly grow and 'centralize'
> in a way,
> > which I don't think is a good idea in a distributed system
> like this.
> >
> > What I was referring to was instituting a new Search
> message type in
> > the Freenet protocol (think somewhat like how Gnutella does
> this, but
> > our own version of course).
> >
> > The way this message would work is very similar to the way
> a chain is
> > formed to pass key data back to the requester, which preserves the
> > anonymity of the searcher.  Caching the search results
> along the chain
> > also protects the privacy of the local index because it
> would contain
> > both locally indexed data and indexes it relayed.  Each
> time a keyword
> > was searched for its results would be distributed to more nodes,
> > expanding the ability to search for that keyword and lessen the
> > bandwidth consumed for X number of results.
> >
> > Would this work? (I am not claming to be an expert of any
> kind, just
> > throwing out an idea for a searching system that doesn't rely on
> > conglomerating indexes.)
> >
> > Jeremy
>
 >
 >
 > You should look up FASD.  FASD is a search mechanism that
 > someone designed a few years ago so that searching could be
 > done in freenet, but the design has laid dormant for a while.
 >  It used a cosine correlation function to determine
 > "closeness" to certain metadata.
>
 > The problem I see with a metadata-key system is that it
 > suffers the same problem as the META tags search engines used
 > to index sites.  How are you going to prove that the indexes
 > are honest and correct?  FASD wanted to make the metadata
 > used for querying decentralized from an insertion standpoint.
 >  That is, publishers were responsible for inserting metadata
 > into freenet.  This means that you have to trust the metadata
 > keys that were inserted.  FASD does have a culling mechanism
 > so that metadata could be validated and deleted, but this
 > system seems like it would be expensive to execute on a large
 > scale network.
 >
 >
 > The idea I have for a search function is to have different
 > search engine sites in freenet.  The search engine
 > maintainers would gather data from freenet by
 > spidering/hand-picking/doing whatever they feel like to
 > generate this index.  When a user goes to this site, a submit
 > form sends a command to a client program (probably integrated
 > with FProxy) to execute a search across the indexes.
>
 > Indexes are stored in the following format:
>
 > [EMAIL PROTECTED]
>
> where Keyword would be a listing of certain sites that would
> correlate to that keyword, along with "weights" for each
> site.  A large search index would have many of these keyword
> pages. So a search for "movies" would look for

> [EMAIL PROTECTED]
>
> which might contain the following:
>
> 17 [EMAIL PROTECTED]
> 15 [EMAIL PROTECTED]
> 7 [EMAIL PROTECTED]
> 5 [EMAIL PROTECTED]
> 3 [EMAIL PROTECTED]

I think it would be better if the client application calculated the
scoring. I assume that you mean that the 'weight' you is somehow
calculated from where in the resource the word was found and how 'valuable'
in the given resource the word is and how many times the word is present in
the resource and so on..

I would recommend that the index file contained this information instead
<Information Domain> <Relative position> <KEY>

****<Relative position>*****
Where relative position is an integer telling at which offset within the
specified information domain the word was found. This enables 'phrase
seaching'/'word sequence searching', like "mailing list" which would
evaluate to 'The word 'mailing' at a position where the next position
contains the word 'list'.
****</Relative position>*****

****<KEY>*****
<KEY> is the resource where the word originates from, typically a SSK or
a CHK or similar.
****</KEY>*****

****<Information Domain>*****
Example <Information Domain> is 'Title', 'Body', 'keywords',
'Name' and so on. The different information domains present in the index
would be choosen by the index-creator/generator and could be published
at '[EMAIL PROTECTED] domains'. Where the
format is something like:
<Domain name> <Count>

Where <Domain name> is the name of the domain and <Count> is the total
number of words found in that domain.

Having a list of 'Information Domains' would allow the search client to
"look in 'name' only" or "look in 'title' or 'body' only". Having the
'Count' of words in the domain name could be used as modifier to the
score calculation for a hit ('Title' hits are more rare than 'Body' hits
and should therefore be considered as more valuable than the 'Body'
hits). Having the information of which domains that are present in the index
readily available could be very useful for helping the user produce his
query.
****</Information Domain>*****

Additionally there should be a file at
'[EMAIL PROTECTED] information' which should contain
the total number of
words in the index (and probably additional index meta-data).
Using the information from this file one could calculate the 'weight' for
each
particular word where 'weight' is defined as Number of
"occurrances of 'wordX'"/"Total number of words in the index" (word
rareness). This would allow the search engine to know that, for
instance, the word 'the' or the word 'in' are quite less valuable than
'movies' or 'internet' and adjust the final hit-scores accordingly. The
number of occurrances
is trivial to sum up from the '[EMAIL PROTECTED]' file.
In earlier generations of index engines one used something called a
'stopword' file
to accomplish the down-weighting (or even full removal from the index of) of
words like 'in' and 'the' but this is really a quite clumsy soloution and
not very
language-independency friendly.

It could also be useful to have a file that contains information about the
'resources'.
For instance: '[EMAIL PROTECTED] resources'
which could contain information like
<KEY> <mimetype> <size>

This enables an additional filtering step to be performed by the search
engine. When a number of
resources has been located by the 'searching' step the search engine could
consult this file to
apply contitions like 'Media files only' or 'zip files of at least 1Megabyte
in size' or 'only HTML and text' to the result.

> Multiple-keyword searches could be done by requesting the
> index page for all of those keywords, and take the
> intersection of all of those indexes.  The weights for a
> specific site across all of those pages could be combined by
> some simple formula (such as addition).

Yes, I agree. And the modifications above enables phrase-searches and
useful stuff like "search for 'filename' only" too. I think that what you
outlined above is the best suggestion for how things should work up to
this date.

I think that the things I ontlined above also makes multiple indexes
search easier to perform. Having only a index-local 'weight' would
make it very hard to establish a common score baseline for hits from
multiple
indexes. Having the search engine (not the index engine) calculate the score
based on the actual 'rareness' of the word would make this much easier.

If someone is interested there also exists a slight extension to the scheme
outlined
above that allows for spelling mistake tolerance
and wordforms (like search, searches, searching and searched) tolerance too.

 /N

_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Search Indexing Round 2

Reply via email to