Hi All,
Given that Lucene scoring can favour shorter fields in documents, in the
past we've had to pad out 'unreasonably' short fields to a set minimum
(with basically nonsense words), I'm wondering how others might have
dealt with this issue.
Another option is to have a custom Similarity class with an altered
lengthNorm method?
Cheers,
Dan
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
From:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
score(q,d) = coord(q,d) · queryNorm(q) · SUM( tf(t in d) ·
idf(t)2 · t.getBoost() · norm(t,d) )
Given one term query, and the term found in two documents doc{a},
doc{b}(with no boost on field, doc or query term)
score(q,d) =~ SUM ( tf(t in d) · norm(t,d) )
and for one term:
score(q,d) =~ tf(t in d) · norm(t,d)
also:
norm(t,d) =~ lengthNorm(field)
lengthNorm(field) :
computed when the document is added to the index in accordance with the
number of tokens of this field in the document, so that shorter fields
contribute more to the score
in DefaultSimilarity.java
lengthNorm(field) = 1/sqrt(num_terms_in_field)
doc{a} field{a} num_terms_in_field = 100, term appears 10 times in
field{a},doc{a}
score =~ 10/sqrt(100) = 1
doc{b} field{a} num_terms_in_field = 300, term appears 10 times in
field{a},doc{a}
score =~ 10/sqrt(300) = 0.577350269
Daniel Rosher
Developer
d: 0207 3489 912
t: 0870 2020 121
f: 0870 2020 131
m:
http://www.hotonline.com/
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
This message is sent in confidence for the addressee only. It may contain
privileged
information. The contents are not to be disclosed to anyone other than the
addressee.
Unauthorised recipients are requested to preserve this confidentiality and to
advise
us of any errors in transmission. Thank you.
hotonline ltd is registered in England Wales. Registered office: One Canada
Square,
Canary Wharf, London E14 5AP. Registered No: 1904765.
This message has been scanned for viruses by BlackSpider MailControl -
www.blackspider.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]