Re: short documents = help me tweak Similarity??

John Kleven Thu, 05 Apr 2007 21:13:52 -0700

Thank you kindly for the responses.

This was the solution that I dreamed up initially as well (overriding
lengthNorm) and making the returned values for small numTerms values (e.g. 3
and 4) more discrete.


So I did that in multiple ways, and I ran into a different problem.  If
lengthNorm returns something like this...
numTerms    return from lengthNorm()
1                  2.4
2                  2.0
3                  1.6
4                  1.2
5                  0.8

5                1/sqrt(numTerms)


...you wind up with a weird situation.  What happens is that for a document
like "Nissan Altima SE Sports Package Sunroof CD 2003" -- imagine someone
searching for that exact document with the exact search string (i.e, they
literally search "Nissan Altima SE Sports Package Sunroof CD 2003") then the
*wrong* hits bubble to the top, e.g., "Nissan Altima Sports Package" will be
the #1 hit even though there was an exact document matching every term.

So then i tried a bunch of different lengthNorm implementations (because
obviously the example lengthNorm above is drastic) ... some drastic (ala the
example above) and some less drastic but still hopefully allowing a big
enough lengthNorm value so that encode/decode would still see a difference
in search length.

And i've yet to be able to find a happy medium that works.  I know that i'm
doing something stupid, or not understanding the scoring algorithm enough (
i.e., there's another part of the similarity class that should take into
account when you have lots of search terms and corresponding matches) but I
just couldn't find a way to make it happen.

This is why i was wondering if could just overload encode/decode in both my
indexer and searcher, and thereby just avoid this whole thing entirely.  But
apparently that isn't a solution either?

Also, i don't understand why the encode/decode functions have a range of
7x10^9 to 2x10^-9, when it seems to me the most common values are (boosts
set to 1.0) something between 1.0 and 0.  When would somebody have a monster
huge value like 7x10^9?  Even with a huge index time boost of 20.0 or
something, why would the encode/decode need a range as huge as the current
implementation?

I know I am missing something.  Thanks so much for the reponses, I've been
playing with this for so long I knew it was time to post a message and get
to the bottom of this.  Thank u again -
J



On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: The problem comes when your float value is encoded into that 8 bit
: field norm, the 3 length and 4 length both become the same 8 bit
: value.  Call Similarity.encodeNorm on the values you calculate for the
: different numbers of terms and make sure they return different byte
: values.

bingo.  You can't change encodeNorm and decodeNorm (because they are
static -- they need to work at query time regardless of what Similarity
was used at index time) but you can change the function of your length
norm to make small deviations in short lengths significant enough to
register even with the encoding.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: short documents = help me tweak Similarity??

Reply via email to