Thank you kindly for the responses. This was the solution that I dreamed up initially as well (overriding lengthNorm) and making the returned values for small numTerms values (e.g. 3 and 4) more discrete.
So I did that in multiple ways, and I ran into a different problem. If lengthNorm returns something like this... numTerms return from lengthNorm() 1 2.4 2 2.0 3 1.6 4 1.2 5 0.8
5 1/sqrt(numTerms)
...you wind up with a weird situation. What happens is that for a document like "Nissan Altima SE Sports Package Sunroof CD 2003" -- imagine someone searching for that exact document with the exact search string (i.e, they literally search "Nissan Altima SE Sports Package Sunroof CD 2003") then the *wrong* hits bubble to the top, e.g., "Nissan Altima Sports Package" will be the #1 hit even though there was an exact document matching every term. So then i tried a bunch of different lengthNorm implementations (because obviously the example lengthNorm above is drastic) ... some drastic (ala the example above) and some less drastic but still hopefully allowing a big enough lengthNorm value so that encode/decode would still see a difference in search length. And i've yet to be able to find a happy medium that works. I know that i'm doing something stupid, or not understanding the scoring algorithm enough ( i.e., there's another part of the similarity class that should take into account when you have lots of search terms and corresponding matches) but I just couldn't find a way to make it happen. This is why i was wondering if could just overload encode/decode in both my indexer and searcher, and thereby just avoid this whole thing entirely. But apparently that isn't a solution either? Also, i don't understand why the encode/decode functions have a range of 7x10^9 to 2x10^-9, when it seems to me the most common values are (boosts set to 1.0) something between 1.0 and 0. When would somebody have a monster huge value like 7x10^9? Even with a huge index time boost of 20.0 or something, why would the encode/decode need a range as huge as the current implementation? I know I am missing something. Thanks so much for the reponses, I've been playing with this for so long I knew it was time to post a message and get to the bottom of this. Thank u again - J On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: The problem comes when your float value is encoded into that 8 bit : field norm, the 3 length and 4 length both become the same 8 bit : value. Call Similarity.encodeNorm on the values you calculate for the : different numbers of terms and make sure they return different byte : values. bingo. You can't change encodeNorm and decodeNorm (because they are static -- they need to work at query time regardless of what Similarity was used at index time) but you can change the function of your length norm to make small deviations in short lengths significant enough to register even with the encoding. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]