Geoff Hutchison writes:
>
> A better way to do it is to have some sort of second-level index that
> makes it easier to weed out which databases you don't need to hit or
> help you keep the number of results managable. As I tried to mention in
> the previous thread, it isn't obvious what the best organization for
> this secondary index is.
>
Yes. Verity (Search97) uses a ngram index to find out if one of the
database contains the word or not. It prevents unecessary database
open/search/fail/close.
Fulcrum does not do that and it works pretty well.
The difference between Verity and Fulcrum is essentially that the upper
limit for a Verity database is 65000 documents (at least it was last year).
They therefore had to cope with the fact that there is a lot of databases
on the same machine. Fulcrum on the contrary is limited by the size of
the inverted file (around 2Gb, whatever the machine is) therefore they have
less database and do not need the second level ngram file.
> I'm beginning to believe that the second-level index should be some sort
> of merged word database. It could simply be an inverted index of words
> pointing to databases (escaping that issue of querying databases that
> don't contain those words) or better yet, a total merge of all the word
> databases.
I tend to advocate for no second level index. With the new htdig
index, provided that you run an OS that has 64bits offset, you will
always be able to build a unique index that reaches the RAM/CPU limit
of the machine you're running on. Let's say that your 2 processor
machine with 1gb of RAM is able to answer requests at a rate of 10
req/sec for a 50Gb index. The problem you have is then not to be able
to open another 50Gb database on the same machine because it has no
RAM/CPU left but to open up to 20 database on 20 different machines (I
think 100% of htdig users will be happy if they can run a 1Tb index
:-). Now, if instead of shuffling the indexes according to the
document name (MD5 key of URL % number of indexes) (as suggested in a
previous mail) we distribute index entries according to the word (MD5
key of word % number of indexes). We will know exactly what index to
query for a given word. Whatever we do it will imply to have a function
to move documents/words from one index to another.
> Fortunately, (or unfortunately depending on perspective), this
> second-level index pretty much has to be constructed after the other
> databases are generated. So it could be compressed significantly.
> Essentially, the code could be test for the existance of the database
> and use it to speed queries if it existed.
>
> Sound reasonable?
I agree with you that this is a possible solution. And a reasonable one.
I'm just trying to figure out if we could find a simpler solution. It would
be really very nice to know how inktomi does. Any paper on the subject,
someone ?
--
Loic Dachary
ECILA
100 av. du Gal Leclerc
93500 Pantin - France
Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.