A caveat first:
There will be pain in the 3.2 upgrade. I've thought about this a great
deal, and unless someone can figure out how you reconstruct the 'lost'
words from the 3.1.x databases, there's no way to upgrade them from
previous versions. You could just convert them, but then any new 3.2
documents would end up dominating the results because they'd have more
words. There's also the whole issue of weight != flags. We can't figure
out where the words came from in 3.1.x, so we can't easily assign flags
to them.
I understand that the above will probably get a bunch of people annoyed
with me. If someone can figure out a way to convert them, all the
better. If not, let's make sure this is the last time we have to make
backwards-incompatible changes.
Back to the original question:
> I would say that the new index structure is exactly trying to
> solve your problem. The only thing that has not yet been discussed is
> how to implement something that would merge answers from many htdig
> databases. Any idea, Geoff
I haven't seen any research to indicate distributed searching is a
solved problem. It's obviously easy enough to loop through the sets of
databases and perform the lookups required on each of them to get a set
of results, then sort/rank and display once they're all collected. But
that's slow--you're opening tons of files and retrieving results that
you don't need. (For example, for the mailing list example, some mailing
lists may not use some words. Why query that database?)
A better way to do it is to have some sort of second-level index that
makes it easier to weed out which databases you don't need to hit or
help you keep the number of results managable. As I tried to mention in
the previous thread, it isn't obvious what the best organization for
this secondary index is.
I'm beginning to believe that the second-level index should be some sort
of merged word database. It could simply be an inverted index of words
pointing to databases (escaping that issue of querying databases that
don't contain those words) or better yet, a total merge of all the word
databases.
Fortunately, (or unfortunately depending on perspective), this
second-level index pretty much has to be constructed after the other
databases are generated. So it could be compressed significantly.
Essentially, the code could be test for the existance of the database
and use it to speed queries if it existed.
Sound reasonable?
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.