Re: product based term combination for BooleanQuery?

Tim Sturge Tue, 03 Jul 2007 12:09:22 -0700

That's true, but it's not clear that I want phrase matches. Consider forexample:

"Lucene Download" as a query. I want something that strongly references"Lucene" (in the title) and strongly references "Download" but "DownloadLucene" or "Lucene Project Download" are better than some page thathappens to contain the exact phrase.

Other examples are "camera review" or "Gonzales scandal"; there's awhole class of "subject <modifier>" queries that are not really phrasebased, and my corpus isn't large enough to necessarily contain thephrase anyway.

I agree that many two or three word queries are really best matched byphrases, but not all. Is it common to use a phrase query with high slopto overcome the unequal weighting problem?

Also, my interface does support "\"John Bush\"" (ie the user can quotethe phrase if they like) and I would prefer not to infer automaticallythat they meant to do so.


Tim

Jason Pump wrote:

You're not using any type of phrase search. Try ->
( (title:"John Bush"^4.0) OR (body:"John Bush") ) AND ((title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) )
or maybe
( (title:"John Bush"~4^4.0) OR (body:"John Bush"~4) ) AND ((title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) )
Tim Sturge wrote:
I'm following myself up here to ask if anyone has experience or codewith a BooleanQuery that weights the terms it encounters on a productbasis rather than a sum basis.
This would effectively compute the geometric mean of the term score(rather than the arithmetic mean) and would give me more "middlebias". It also has the great advantage that it automaticallyimplements AND (as something without the term has a score of 0.0which causes the query to go to 0.0 as well.)
I'm curious though why this doesn't already exist. Is it a bad ideain general (that I will discover once I implement it and look at theresults?) or does it make searching a lot slower?
Thanks,

Tim

Tim Sturge wrote:
I have an index with two different sources of information, one smallbut of high quality (call it "title"), and one large, but of lowerquality (call it "body"). I give boosts to certain documentsrelated to their popularity (this is very similar to what one woulddo indexing the web).
The problem I have is a query like "John Bush". I translate thatinto " (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) ".But the results I get are:
1. George Bush
...
4. John Kerry
...
10. John Bush

The reason is (looking at explain) that George Bush is scored:
169 = sum(
1 =  <match in body with tiny norm for "John">
)
168 = sum(
    160 = <title match for "Bush">
    8 = <body match for "Bush">
)
)
and John Kerry is similar but reversed. Poor old "John Bush" onlyscores:
72 = sum(
 40 = (<title match for "John">+<body match>)
 32 = (<title match for "Bush">+ <body match>)
)

because his initial boost was only 1/4 of George's.
The question I have is, how can tell the searcher to care about"balance"? I really want the score over 2 terms to be more like(sqrt(X)+sqrt(Y))^2 or maybe even exp(log(X)+log(Y)) rather thanjust X+Y. Is that supported in some obvious way, or is there someother way to phrase my query to say "I want both terms but theyshould both be important if possible?"
Thanks,

Tim







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: product based term combination for BooleanQuery?

Reply via email to