Chuck Williams wrote:
I think the differences are pretty clear as the systems stands.  Notice
a substantial difference in the idf's in the respective explanations.  I
continue to think the current mechanism weights these too high,
primarily due to its squaring.

The other big difference occurs when all query terms are not required,
as the current mechanism then does not consider term diversity (e.g., t1
in title and in content gets as a good a score as t1 in title and t2 in
content), while the new approach does.

Right. I'd like to be able to separately discuss such issues and how to fix them. Confounding them makes changes to Lucene an all-or-nothing proposition. What will be easiest procedurally is to make a series of uncontroversial, clear improvements to the code, not wholesale replacements. In the end we may get to the same place, but we'll still have more people on board. I don't think a revolution is required, just some evolution.


If we want to change the way idf is used, is there a reason we cannot evaluate that change on its own, then, once that's settled, move on to the next issue? We may find that some things cannot be changed in isolation, my guess is that idf and "term diversity" can and should be discussed separately.

It would translate a query "t1 t2" given fields f1 and f2
into
> something like:
> > +(f1:t1^b1 f2:t1^b2)
> +(f2:t1^b1 f2:t2^b2)
> f1:"t1 t2"~s1^b3
> f2:"t1 t2"~s2^b4


This does not seem scalable.  How do you expand a general query with n
terms?

Perhaps my example was unclear. Here's a three term query:

+(f1:t1^b1 f2:t1^b2)
+(f1:t2^b1 f2:t2^b2)
+(f1:t3^b1 f2:t3^b2)
f1:"t1 t2 t3"~s1^b3
f2:"t1 t2 t3"~s2^b4

Is that any clearer?

I sent a not earlier today suggesting that a new Query class is needed
that simultaneously handles multiple fields, term diversity and term
proximity.

Is that distinct from my goal to develop an improved MultiFieldQueryParser for Lucene 2.0?


  > Do folks agree that this is a good general formulation?

Not unless it is scalable and the desire is to require all query terms.

I'm not sure what you mean by scalable.

I would rather not require all query terms, which introduces a more
complex diversity requirement (ensure that as many distinct query terms
as possible are matched somewhere).

Requiring all query terms is acceptable and even expected by most searchers today. All of the major web search engines implement this, and that's where folks learn to search today.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to