[EMAIL PROTECTED] wrote:
> This where my idea come into play : let's say the document identifier
> is not a number but a list of numbers.
>
> word id1/id2/id3 id1/id2/id3 id1/id2/id3 ...
..
> id1 identifies the tag in which the word appear (title, anywhere)
Yes, we already have this in the word db changes. I called it 'flag'
> id2 identifies the server name
> id3 identifies the document itself
Actually id2/id3 -> DocID, which is unique for the document across all
servers.
> And we have a document list very well suited for queries like
Basically what you outlined was the format of the word database in
3.1.x, with 'weight' in place of id1 and 'DocID' in place of id2/id3.
You also suggested that the word list be sorted.
This would be cool in the world we were just in However, in the world
where we're supposed to keep track of location for phrase searching,
it's a bit more complex. (I wish it weren't and I'd love to be proven
wrong). Somewhere we have to keep track of *each* word in a document and
their location, in case someone wants to do a phrase search (or a "near"
search or a weighted search by proximity).
We can still have such a list (and/or other index formats). But I wanted
to point out it will require duplicating some of the data in the other
word db.
>> This is something I already have outlined in my head. The problem is
>> that the "sorting" depends on what we're sorting. :-) However, given a
>> list of *possible* documents, the "sorting" occurs by a min-heap of size
>> N. Basically, you check the next document in the list, if it's bigger
>> than the smallest so far, you put it in the heap and drop off the
>> smallest.
>
> I'm not sure I understand. Why bigger or smaller document has something to
> do with sorting ?
I should have been a little clearer. Bigger 'weight' or smaller 'weight'
documents, based on whatever the sort is using.
> The best book on the subject, I agree. I also like 'Information Retrieval'
> (Baeza-Yates). But I don't understand your objection. Could you point to a
> specific chapter that explain your point of view ? I have an e-mail contact
> with one of Alastair Moffat student who will certainly be interested to
> join the discussion.
I don't have my copy handy. I'll e-mail you the section reference later.
But I was referring to the continue/quit strategies for rankings. As I
read it, you don't actually read the entire list of words.
> I really enjoy this discussion :-) I'm not sure my ideas do not
> violate IR theory, therefore I need to discuss them.
I'm not sure if this is the *best* place for IR theory. I wouldn't mind
seeing ht://Dig as an avenue for IR research, much like gcc is used for
some compiler research. After all, the list of ht://Dig users is quite
heterogeneous and could provide for interesting real-life sampling. But
ultimately the project must provide more than theory. ;-)
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.