Re: product based term combination for BooleanQuery?

Erick Erickson Wed, 04 Jul 2007 12:09:54 -0700

It looks like you may already be aware of this, but if not, this is
something Chris H. posted quite a while ago that I found useful....


<<<it depends on your goal.  index time field boosts are a way to express
things like "this documents title is worth twice as much as the title of
most documents" query time boosts are a way to express "i care about
matches on this clause of my query twice as much as i do about matches to
other clauses of my query.>>>

Erick

On 7/4/07, Tim Sturge <[EMAIL PROTECTED]> wrote:



:-) The use of wikipedia data here is no secret; it's all over
www.freebase.com. I just hoped to avoid being sucked into a "what is the
best way to index wikipedia with Lucene?" discussion, which I believe
several other groups are already tackling.

At index time, I used a per document boost (over all fields) and a per
field bost (over all documents). I can certainly factor out the first into a
query boost, but I was under the impression that if I ever wanted to combine
fields (eg to index all "name" "alias" and "title" data in a single "head"
field) then I had to pre-boost the data prior to combining it. I tend to
believe that these (short) fields contain more relevant information than
(long) wikipedia articles or other documents.

Should idf and tf take care of that short/long quality distinction? It
sounds like you feel they should.
I'll build an index without the per field boost and see if that produces
improved results.

Thanks,

Tim

----- Original Message -----
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To: "Lucene Users" <[email protected]>
Sent: Tuesday, July 3, 2007 10:26:57 PM (GMT-0800) America/Los_Angeles
Subject: Re: product based term combination for BooleanQuery?


(side note: if you are going to try and obfuscate your field names when
sending explain output so we don't know you are using wikipedia data (not
that we care), please at least be consistent about it so the final
explanations actual make sense -- it will save everyone a lot of confusion
and help us help you)

the biggest factor in your scores seems to be the fieldNorms for your
name, title and alias fields ... they are so high, that tf and idf are
pretty much irrelevant.

By the looks of it, when you were indexing your docs, you used a
consistent field boost per field on every instance of that field for every
document ... this is really not a use case where index time field (or
document) boosts make sense.  in my opinion hte number one thing you can
do to imrpove your relevency right now is to stop using index time
boosts and use query boosts instead.

If you don't want to reindex completely the LengthNormModifier class (in
the misc contrib) can update all of your norms in place without reindexing
and throw away any index time boosts you had.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: product based term combination for BooleanQuery?

Reply via email to