Erik Hatcher wrote:

On Friday, September 5, 2003, at 07:45 PM, Leo Galambos wrote:

And for the second time today.... QueryFilter. It allows narrowing the documents queried to only the documents from a previous Query.



I guess, it would not be an ideal solution - the first query does two things a) it selects a subset from the corpus; b) it assigns a relevance to each document of this subset. Your solution omits the second point. It implies, the solution will not return "good hit lists", because you will not consider the information value of the first query which was given to you by a user.


Yes, you're right. Getting the scores of a second query based on the scores of the first query is probably not trivial, but probably possible with Lucene. And that combined with a QueryFilter would do the trick I suspect. Somehow the scores of the first query could be remembered and used as a boost (or other type of factor) the scores of the second query.


Well, I do not want to be a pessimist, but the boost vector is not a good solution due to CWI statistics. On the other hand, it is much better than the simple QueryFilter which, in fact, works as 0/1 boost.

Example: I use this notation: inverted_list_term:{list of W values, "-" denotes W=0, for 12 documents in a collection}
A:{23[16]------27}
B:{--[38]--------}
C:{18[2-]45239812}
If your first query is B, the subset of documents (denoted by brackets - namely, the 3rd and 4th doc) is selected, and if your second query is "A C", then you cannot use global IDFs, because in the subset, the IDF factors are different. Globally, A is better distriminator, but in the subset, C is better. This fact is then reflected by the hit list you generate, and I guess, the quality will be also affected by this.


The example shows, that you would rather export the subset to an auxiliary index (RAMDirectory?) and then use this structure instead of the original index. Obviously, it will solve the issue of speed you mentioned.

Unfortunately, I am not sure, if you can export the inverted lists when you read them. In egothor, I would use a listener in Rider class, in Lucene, I would have to rewrite some classes and it could be a real problem. Maybe, there is a solution I do not see...

Your turn ;-)
Cheers,
Leo


Am I off base here?


Thus I think, Chris would implement something more complex than QueryFilter. If not, the results will be poorer than with the commercial packages he may get. He could use a different model where "AND" is not an associative operator (i.e. some modification of the extended Boolean model). It implies, he would implement it in Similarity.java (if I remember that class name correctly).


Right... but you'd still need the filtering capability as well, I would think - at least for performance reasons.

Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to