[htdig3-dev] Re: New APIs (was Shared libs)

loic Mon, 19 Jul 1999 23:52:35 -0700

 > Unless I misunderstand you, Thesaurus = "Synonym Fuzzy" as it presently
 > exists.

 That's correct. 

 > Filtering, by necessity, will have to go *after* document retrieval is
 > performed, which is a great complication. It's impossible to restrict by
 > date w/o retrieving the date stamps from the document db. But if you
 > filter afterwards, the "query solver" doesn't know how many to return
 > such that the query will fulfill the user's request for, e.g. the top
 > 10.

 Completely generic filter has to be applied to the whole set of possible
documents for a given query, as well as generic (user provide for instance)
sort. It must be possible, however, to prepare the index so that it can
provide answers to frequent queries without having to read the whole document
list. Here is my idea:

      For each word there is a list of document identifiers. The list
of document identifiers is sorted (ascending or descending order). This
allows to quickly perform a word1 and word2 :
       . chose the less frequent of word1 and word2 (let's say it's word1)
       . walk word1 document identifiers list
       . in parallel walk word2 document identifiers list
       . stop when the desired number of document identifiers (first 10 for
         instance) have been retrieved.
      This works assuming that 1) there is no filtering, 2) no sorting.
      We can see how this generalizes for word1 and ( word2 or word3 ) for
instance, even if terms or expressions are weighted (like in : 
(weight 100 word1) and (word2 weight 20 or word3 weight 70) ).

      This already covers a lot of common cases, but this is not enough.
In particular it does not cover the fielded search like in :
(weight 100 title: word1) or (weight 50 anywhere: word1). For this very
common requirement, we have to walk all the list to properly apply the
weight.
       This where my idea come into play : let's say the document identifier
is not a number but a list of numbers. 

       word id1/id2/id3 id1/id2/id3 id1/id2/id3 ...

       This allows us to sort the list of document identifiers associated
to a word in multiple ways. id1 ascending, id2 descending, id3 ascending
for instance. And why is it an advantage ? Because we could say the following

       id1 identifies the tag in which the word appear (title, anywhere)
       id2 identifies the server name
       id3 identifies the document itself 

       And we have a document list very well suited for queries like

(weight 100 title: word1) or (weight 50 anywhere: word1)

        I hope you caught the idea, despite the fact that my explanations are
not really clear. The main objections to this model are the following (maybe
you'll find more :-).

       1) "This does not solve the filtering and user provided sorting performance
          issues at all." Yes but it's not meant to. I see no way to solve filtering
          and user provided sorting performance issues : all documents matching
          a query must be fed to the filter/sort procedure. No way to escape that.
       2) "This is much too specific. It lacks generality." This is not true. You
          can put it this way : the constraint when building an inverted index is
          the following: for space efficiency we are only allowed to have one inverted
          index associating word -> document list. We cannot escape that
          and start building many indexes to improve efficiency. The only thing we
          can do to improve query efficiency is to sort the list of document in 
multiple
          ways so that it matches the most common usage. This can be done by 
introducing
          abritrary numbers in the document identifier. By default this feature is off
          and you have a traditional index. If the user specifies in a configuration
          file something like the following:

                   document identifier = field title   [ asc ]
                                         hostname(url) [ desc ]
                                         url           [ asc ]

          then each document identifier is built according to this specification. 

       3) "Having more than one number for a given document identifier will eat a 
          *lot* of space.". This is not true. First, numbers that have a small range
          can efficiently be unary coded. Then each list of number being sorted, it
          can be stored as a delta from the previous one. It eats more space but
          this is reasonable regarding the advantages we get.
       4) "The IR litterature clearly says that hierachical document organization
           must *not* be stored in the index". Yes but our goal is not to store
           the hierarchy structure of the document. Our goal is to craft an index
           whose structure is pre-ordered to match the most common usage. It looks
           like a document hierarchy but it does not represent the document hierarchy
           (see that the number identifying the title tag is on top of the number
            list for instance)

        To put it very simply : the only way to improve query resolution performances
while retaining a compact index and generality is to have a multi-number document 
identifier.
I'm ready for flaming now :-)

 > This is something I already have outlined in my head. The problem is
 > that the "sorting" depends on what we're sorting. :-) However, given a
 > list of *possible* documents, the "sorting" occurs by a min-heap of size
 > N. Basically, you check the next document in the list, if it's bigger
 > than the smallest so far, you put it in the heap and drop off the
 > smallest.

 I'm not sure I understand. Why bigger or smaller document has something to
do with sorting ?

 > >   These restrictions are essential for a large scale index that is
 > > used under high stress. The preriquisite is quite clear: the index
 > > must be organized to allow this (document lists for each word must be
 > > sorted to allow ranking/sorting without reading the whole document list). If it
 > > is not, all the list must be read.
 > 
 > Strangely enough, this isn't quite true. Going into the details would
 > require a rather lengthy explanation, but anyone keen on the details
 > should pick up a copy of the 2nd edition of _Managing Gigabytes_ by
 > Moffat, Bell, and Whitten. I'm finding it well worth the investment.

 The best book on the subject, I agree. I also like 'Information Retrieval' 
(Baeza-Yates). But I don't understand your objection. Could you point to a
specific chapter that explain your point of view ? I have an e-mail contact
with one of Alastair Moffat student who will certainly be interested to
join the discussion.

 > I think this is a bit separate. We probably want to *cache* frequent
 > queries. For example, something like 90% of the queries on AltaVista are
 > from a list of about 1000 queries. So they just spit off a pre-computed
 > answer from the cache. IMHO, that doesn't mean we should necessarily be
 > storing wordlists in some ranked order when that ranking may change.
 > AFAIK, AltaVista doesn't rank by date, but we offer it (admittedly at a
 > performance penalty).

 I do not agree with that, but I'm open to discussion. We run a search
engine here in france. It is much less popular than AltaVista but answers
around 60 000 requests per day. We maintain a cache of query answers
and indeed, many requests are repeated. But the ratio is not even close
to 1000 queries that make 90% of the volume. I would say that 1000 words
are very frequent and appear in 50-60% of the requests. But they are
often mixed with other words, which changes the set of results. Anyway,
if we keep the query answers cache during 24 hours, we have around 
40 000 entries. The saving is great but does not really make a big
difference for the query process. We could keep the query cache for
7 days instead of 24 hours and save more. But we would face inconsistencies
as the index is updated. Therefore we must stick to 24 hours.

 I really enjoy this discussion :-) I'm not sure my ideas do not
violate IR theory, therefore I need to discuss them.

-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
                e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
[htdig3-dev] Re: New APIs (was Shared libs)

Reply via email to