Re: Title Search scoring issues with multivalued field & norm

Erick Erickson Wed, 31 Jan 2018 09:10:29 -0800

Or use a boost for the phrase, something like
"beauty and the beast"^5


On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <wun...@wunderwood.org> wrote:
> You can use a separate field for title aliases. That is what I did for 
> Netflix search.
>
> Why disable idf? Disabling tf for titles can be a good idea, for example the 
> movie “New York, New York” is not twice as much about New York as some other 
> film that just lists it once.
>
> Also, consider using a popularity score as a boost.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar <sra...@caavo.com> wrote:
>>
>> Hi,
>> We are using solr for our movie title search.
>>
>>
>> As it is "title search", this should be treated different than the normal
>> document search.
>> Hence, we use a modified version of TFIDFSimilarity with the following
>> changes.
>> -  disabled TF & IDF and will only have 1 as value.
>> -  disabled norms by specifying omitNorms as true for all the fields.
>>
>> There are 6 fields with different analyzers and we make use of different
>> weights in edismax's qf & pf parameters to match tokens & boost phrases.
>>
>> But, movies could have aliases and have multiple titles. So, we made the
>> fields multivalued.
>>
>> Now, consider the following four documents
>> 1>  "Beauty and the Beast"
>> 2>  "The Real Beauty and the Beast"
>> 3>  "Beauty and the Beast", "La bella y la bestia"
>> 4>  "Beauty and the Beast"
>>
>> Note: Document 3 has two titles in it.
>>
>> So, for a query "Beauty and the Beast" and with the above configuration all
>> the documents receive same score. But 1,3,4 should have got same score and
>> document 2 lesser than others.
>>
>> To solve this, we followed what is suggested in the following thread:
>> http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html
>>
>> Now, the fields which are used to boost are made to use Norms. And for
>> matching norms are disabled. This is to make sure that exact & near exact
>> matches are rewarded.
>>
>> But, for the same query, we get the following results.
>> query: "Beauty & the Beast"
>> Search Results:
>> 1>  "Beauty and the Beast"
>> 4>  "Beauty and the Beast"
>> 2>  "The Real Beauty and the Beast"
>> 3>  "Beauty and the Beast", "La bella y la bestia"
>>
>> Clearly, the changes have solved only a part of the problem. The document 3
>> should be ranked/scored higher than document 2.
>>
>> This is because lucene considers the total field length across all the
>> values in a multivalued field for normalization.
>>
>> How do we handle this scenario and make sure that in multivalued fields the
>> normalization is taken care of?
>>
>>
>> --
>> Regards,
>> Sravan
>

Re: Title Search scoring issues with multivalued field & norm

Reply via email to