: > q=david bowie changes : > : > Problem : If a record mentions david bowie a lot, it beats out something : > more relevant (more unique matches) ... : > : > A. (now appearing david bowie at the cineplex 7pm david bowie goes on stage, : > then mr. bowie will sign autographs) : > B. song :david bowie - changes : > : > (A) ends up more relevant because of the frequency or number of words in : > it.. not cool... : > I want it so the number of words matching will trump density/weight....
debugQuery=true is your freind .. it will show you exactly how the scores are being computed. the key factors in something like this are fieldNorm, tf, and the coord factor. The fieldNorm includes as a factor the length of the field, so as long as you have omitNorm=false configured for this field, doc#A should be panalized relative doc#B for being longer -- but if you omitNorm's then that won't help you -- so start by checking that. The coord factor will penalize documents that don't match all of the clauses of a boolean query (ie: doc #A only matches 2/3 clauses becuase it doesn't match the word "changes") so you could customize your Similarity implementation to make that coord penalty higher, but that requires some custom java code. As an extreme option, you could use omitTf to completley eliminate the term frequency from being a factor in scoring so the number of times "bowie" appears won't affect the score, just that it appears at least once) but that probably isn't what you want: "david bowie changes some stuff" would get the same score as "david bowie changes david bowie" in general the simplest way to deal with a lot of this type of thing is to think about how you are structuring your query. something as simple as using the dismax parser with your field in both the "qf" and "pf" fields (and a little bit of slop in the "ps" param) may give you exactly what you want (since it will reward docs where the whole query string appears in the field... https://wiki.apache.org/solr/DisMaxQParserPlugin -Hoss