Re: short documents = help me tweak Similarity??

Andrew Hudson Thu, 05 Apr 2007 13:03:58 -0700

The problem comes when your float value is encoded into that 8 bit
field norm, the 3 length and 4 length both become the same 8 bit
value.  Call Similarity.encodeNorm on the values you calculate for the
different numbers of terms and make sure they return different byte
values.


Andrew

On 4/5/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

As far as I know, this is the case where you want your custom Similarity that 
knows how to deal with a small number of terms.

  public float lengthNorm(String fieldName, int numTerms) {
    if (numTerms < N)
      // return something smart
    return (float)(1.0 / Math.sqrt(numTerms));
  }

I think the rest of what you said is correct.  Look at this piece of Similarity 
javadoc:

 *      However the resulted <i>norm</i> value is [EMAIL PROTECTED] 
#encodeNorm(float) encoded} as a single byte
 *      before being stored.
 *      At search time, the norm byte value is read from the index
 *      [EMAIL PROTECTED] org.apache.lucene.store.Directory directory} and
 *      [EMAIL PROTECTED] #decodeNorm(byte) decoded} back to a float 
<i>norm</i> value.
 *      This encoding/decoding, while reducing index size, comes with the price 
of
 *      precision loss - it is not guaranteed that decode(encode(x)) = x.
 *      For instance, decode(encode(0.89)) = 0.75.
 *      Also notice that search time is too late to modify this <i>norm</i> 
part of scoring, e.g. by
 *      using a different [EMAIL PROTECTED] Similarity} for search.

If you come up with a more generic lengthNorm that dels with "overly short" 
documents/fields well, I'd love to know! :)

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: John Kleven <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, April 5, 2007 1:45:34 PM
Subject: Re: short documents = help me tweak Similarity??

Sorry to re-post -- is this the correct forum for questions like this?  I
think that writing a new encode/decode operation should help alleviate my
problem, but thought that this must be fairly widespread issue for people
using lucene for "non-web-page" searches (i.e., shorter documents)

Thanks again,
John

On 4/2/07, John Kleven <[EMAIL PROTECTED]> wrote:
>
> My documents are cars...
> i.e.,
> Nissan Altima Sports Package
> Nissan Altima Standard
>
> The problem I have is when i search "Nissan Altima", I want to get the 2nd
> hit back first, i.e. "Nissan Altima Standard", because it is shorter.
> However, this doesn't happen.  They are both scored the exact same.
>
> I know that the lengthNorm in Similarity is using 1/sqrt(numTerms), and
> you would think that would be enuff to make sure the order is correct.
> However, it is not, and I assume this is because of the encode/decode
> functions that pack this value into a single byte do not have the
> granularity to represent differences between numbers like 1/sqrt(3) vs
> 1/sqrt(4)??
>
> Is the suggested approach here to re-write the encode/decode operations,
> or is there any easier way?
>
> Thanks kindly -
> John

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: short documents = help me tweak Similarity??

Reply via email to