Geoff Hutchison writes:
> OK, this week I'm going to give a brief overview of what goes on for a
> search. This side is obviously very important because fast response is
> critical--yet the whole process in htsearch is fairly complex. I'll
> briefly touch on the new "collection" support too.
First thanks a lot for these tutorials. They are very helpfull.
> Some speedup could come from more sophisticated retrieval and scoring
> mechanisms. Certainly performing all boolean operations at once
> (rather than pairwise) could speed up multiple term queries. Limiting large
> searches could also help, along the lines of what is described in
> _Managing Gigabytes_, either using a frequency-sorted or uniform
> distribution of words. Result or score caches would obviously also help.
Some comments on this and the word database (inverted index) structure.
The structure of the inverted index makes it possible to open a
search cursor for every word in the query. Searching the first
occurences of each searched terms in parallel is therefore
supported. The frequency of terms may also be maintained by the
inverted index. It is not maintained by default but the
'wordlist_extend: true' activates this.
The inverted index is also able to store word occurences according to
relevance ranking (provided the relevance ranking of each word can be
calculated at indexing time). This way the first 10 occurences of a word
are always the most relevant.
Obviously there are some relevance ranking algorithms that need to
work on all the occurences of the words or the documents found and in
this case you have to retrieve all of them (word occurences or
documents). But for simple queries with relevance ranking encoded in
the inverted index, the number of word occurences that need to be
retrieved for each search can be close to optimal.
I studied the search mechanism of htdig and figured out that changing it
to take advantage of the index structure is not a trivial task. I did chose
to focus on the index structure first and have a reliable piece of code
before diving into this. The last fix commited shows that this part is
quite tricky ;-)
Cheers,
--
Loic Dachary
24 av Secretan
75019 Paris
Tel: 33 1 42 45 09 16
e-mail: [EMAIL PROTECTED]
URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.