Hi Chris,
Here is an approach which works based on the quantity
of matching terms in an adapted BooleanQuery:

http://issues.apache.org/bugzilla/show_bug.cgi?id=35284

Paul makes an interesting obversation at the end which
shows how this functionality can be added to the
existing BooleanQuery without too much effort. I'd
personally like to see this added to BooleanQuery. As
an example application, I currently use this
functionality in my custom
CoordConstrainedBooleanQuery to prevent "More Like
This" queries returning long lists of dissimilar
documents by insisting on 30% of generated query terms
matching.

This approach of course is based purely on the
quantity of matching terms, not the quality-based
measures in your example. As you suggest, quality is a
combination of user-derived measures (boosts) and
data-derived measures (tf,idf, docBoost). It sounds
like a more informed  approach in principle but I'm
not currently sure how it would be implemented
efficiently in practice. Here's one possible approach
I can think of:
I have previously optimized large BooleanQueries
generated by nGrams before now by taking only the top
idf-ranked terms - purely to reduce query times. A
similar approach could be used to automatically
rewrite a BooleanQuery consisting of entirely optional
terms into the equivalent of:
+( my high idf terms) (low idf terms)
Basically this produces a query that MUST match the
decent terms and scores extra points for the "optional
extras". Query term boosts could be factored into the
decision for selecting the "Must have" terms and "nice
to haves".
This would help maintain a minimum level of relevance
when relevance isn't the primary sort field.

Cheers,
Mark


        
        
                
___________________________________________________________ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail 
http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to