Sure, here are some real world examples from my time at Netflix.

Is this movie twice as much about “new york”?

* New York, New York

Which one of these is the best match for “blade runner”:

* Blade Runner: The Final Cut
* Blade Runner: Theatrical & Director’s Cut
* Blade Runner: Workprint

http://dvd.netflix.com/Search?v1=blade+runner 
<http://dvd.netflix.com/Search?v1=blade+runner>

At Netflix (when I was there), those were shown in popularity order with a 
boost function.

And for stemming, should the movie “Saw” match “see”? Maybe not.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 20, 2016, at 5:28 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote:
> 
> Maybe it's a cultural difference, but I can't imagine why on a query for
> "John", any of those titles would be treated as anything other than equals
> - namely, that they are all about John. Maybe the issue is that this seems
> like a contrived example, and I'm asking for a realistic example. Or, maybe
> you have some rule of relevance that you haven't yet shared - and I mean
> rule that a user would comprehend and consider valuable, not simply a
> mechanical rule.
> 
> 
> 
> -- Jack Krupansky
> 
> On Wed, Apr 20, 2016 at 8:10 PM, <jimi.hulleg...@svensktnaringsliv.se>
> wrote:
> 
>> Ok sure, I can try and give some examples :)
>> 
>> Lets say that we have the following documents:
>> 
>> Id: 1
>> Title: John Doe
>> 
>> Id: 2
>> Title: John Doe Jr.
>> 
>> Id: 3
>> Title: John Lennon: The Life
>> 
>> Id: 4
>> Title: John Thompson's Modern Course for the Piano: First Grade Book
>> 
>> Id: 5
>> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
>> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
>> Mrs. Surratt
>> 
>> 
>> And in general, when a search word matches the title, I would like to have
>> the length of the title field influence the score, so that matching
>> documents with shorter title get a higher score than documents with longer
>> title, all else considered equal.
>> 
>> So, when a user searches for "John", I would like the results to be pretty
>> much in the order presented above. Though, it is not crucial that for
>> example document 1 comes before document 2. But I would surely want
>> document 1-3 to come before document 4 and 5.
>> 
>> In my mind, the fieldNorm is a perfect solution for this. At least in
>> theory. In practice, the encoding of the fieldNorm seems to make this
>> function much less useful for this use case. Unless I have missed something.
>> 
>> Is there another way to achive something like this? Note that I don't want
>> a general boost on documents with short titles, I only want to boost them
>> if the title field actually matched the query.
>> 
>> /Jimi
>> 
>> ________________________________________
>> From: Jack Krupansky <jack.krupan...@gmail.com>
>> Sent: Thursday, April 21, 2016 1:28 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Is it possible to configure a minimum field length for the
>> fieldNorm value?
>> 
>> I'm not sure I fully follow what distinction you're trying to focus on. I
>> mean, traditionally length normalization has simply tried to distinguish a
>> title field (rarely more than a dozen words) from a full body of text, or
>> maybe an abstract, not things like exactly how many words were in a title.
>> Or, as another example, a short newswire article of a few paragraphs vs. a
>> feature-length article, paper, or even book. IOW, traditionally it was more
>> of a boolean than a broad range of values. Sure, yes, you absolutely can
>> define a custom similarity with a custom norm that supports a wide range of
>> lengths, but you'll have to decide what you really want  to achieve to tune
>> it.
>> 
>> Maybe you could give a couple examples of field values that you feel should
>> be scored differently based on length.
>> 
>> -- Jack Krupansky
>> 
>> On Wed, Apr 20, 2016 at 7:17 PM, <jimi.hulleg...@svensktnaringsliv.se>
>> wrote:
>> 
>>> I am talking about the title field. And for the title field, a sweetspot
>>> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
>>> value that differentiates between for example 2, 3, 4 and 5 terms in the
>>> title, but only very little.
>>> 
>>> The 20% number I got by simply calculating the difference in the title
>>> fieldNorm of two documents, where one title was one word longer than the
>>> other title. And one fieldNorm value was 20% larger then the other as a
>>> result of that. And since we use multiplicative scoring calculation, a
>> 20%
>>> increase in the fieldNorm results in a 20% increase in the final score.
>>> 
>>> I'm not talking about "scores as percentages". I'm simply noting that
>> this
>>> minor change in the text data (adding or removing one single word) causes
>>> the score to change by a almost 20%. I noted this when I renamed a
>>> document, removing a word from the title, and that single change caused
>> the
>>> document to move up several positions in the result list. We don't want
>>> such minor modifications to have such big impact of the resulting score.
>>> 
>>> I'm not sure I can agree with you that "the effect of document length
>>> normalization factor is minimal". Then why does it inpact our result in
>>> such a big way? And as I said, we don't want to disable it completely, we
>>> just want it to have a much lesser effect, even on really short texts.
>>> 
>>> /Jimi
>>> 
>>> ________________________________________
>>> From: Ahmet Arslan <iori...@yahoo.com.INVALID>
>>> Sent: Thursday, April 21, 2016 12:10 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Is it possible to configure a minimum field length for the
>>> fieldNorm value?
>>> 
>>> Hi Jimi,
>>> 
>>> Please define a meaningful document-lenght range like min=1 max=50.
>>> By the way you need to reindex every time you change something.
>>> 
>>> Regarding 20% score change, I am not sure how you calculated that number
>>> and I assume it is correct.
>>> What really matters is the relative order of documents. It doesn't mean
>>> anything addition of a word decreases the initial score by x%. Please
>> see :
>>> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>>> 
>>> There is an information retrieval heuristic which says that addition of a
>>> non-query term should decrease the score.
>>> 
>>> Lucene's default document length normalization may favor short document
>>> too much. But folks blend score with other structural fields
>> (popularity),
>>> even completely bypass relevancy score and order by price, production
>> date
>>> etc. I mean there are many use cases, the effect of document length
>>> normalization factor is minimal.
>>> 
>>> Lucene/Solr is highly pluggable, very easy to customize.
>>> 
>>> Ahmet
>>> 
>>> 
>>> On Wednesday, April 20, 2016 11:05 PM, "
>>> jimi.hulleg...@svensktnaringsliv.se" <
>> jimi.hulleg...@svensktnaringsliv.se>
>>> wrote:
>>> Hi Ahmet,
>>> 
>>> SweetSpotSimilarity seems quite nice. Some simple testing by throwing
>> some
>>> different values at the class gives quite good results. Setting ln_min=1,
>>> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
>>> less what I want. At least for the title field. I'm not sure what the
>>> actual effect of those settings would be on longer text fields, so maybe
>> I
>>> will use the SweetSpotSimilarity only for the title field to start with.
>>> 
>>> Of course I understand that there are many things that can be considered
>>> domain specific requirements, like if to favor/punish short/medium/long
>>> texts, and how. I was just wondering how many actual use cases there are
>>> where one want's a ~20% difference in score between two documents, where
>>> the only difference is that one of the documents has one extra word in
>> one
>>> field. (And now I'm talking about an extra word that doesn't affect
>>> anything else except the fieldNorm value). I for one find it hard to find
>>> such a use case, and would consider it a very special use case, and would
>>> consider a more lenient calculation a better fit for most use cases (and
>>> therefore most domains). :)
>>> 
>>> /Jimi
>>> 
>>> 
>>> -----Original Message-----
>>> From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
>>> Sent: Wednesday, April 20, 2016 8:14 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Is it possible to configure a minimum field length for the
>>> fieldNorm value?
>>> 
>>> Hi Jimi,
>>> 
>>> SweetSpotSimilarity allows you define a document length range, so that
>> all
>>> documents in that range will get same fieldNorm value.
>>> In your case, you can say that from 1 word up to 100 words do not employ
>>> document length punishment. If a document is longer than 100 do some
>>> punishment.
>>> 
>>> By the way; favoring/punishing  short, middle, or long documents is
>> domain
>>> specific thing. You are free to decide what to do.
>>> 
>>> Ahmet
>>> 
>>> 
>>> 
>>> On Wednesday, April 20, 2016 7:46 PM, "
>> jimi.hulleg...@svensktnaringsliv.se"
>>> <jimi.hulleg...@svensktnaringsliv.se> wrote:
>>> OK. Well, still, the fact that the score increases almost 20% because of
>>> just one extra term in the field, is not really reasonable if you ask me.
>>> But you seem to say that this is expected, reasonable and wanted behavior
>>> for most use case?
>>> 
>>> I'm not sure that I feel comfortable replacing the default Similarity
>>> implementation with a custom one. That would just increase the complexity
>>> of our setup and would make future upgrades harder (we would for example
>>> have to remember to check if the default similarity configuration or
>>> implementation changes).
>>> 
>>> No, if it really is the case that most people like and want this, and
>>> there is no way to configure Solr/Lucene to calculate fieldNorm in a more
>>> reasonable way (in my book) for short field values, then I just think we
>>> are forced to set omitNorms="true", maybe in combination with a simple
>>> field boost for shorter fields.
>>> 
>>> /Jimi
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
>>> Sent: Wednesday, April 20, 2016 5:18 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Is it possible to configure a minimum field length for the
>>> fieldNorm value?
>>> 
>>> FWIW, length for normalization is measured in terms (tokens), not
>>> characters.
>>> 
>>> With TDIFS similarity (the default before 6.0), the normalization is
>> based
>>> on the inverse square root of the number of terms in the field:
>>> 
>>> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>>> 
>>> That code is in ClassicSimilarity:
>>> 
>>> 
>> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
>>> 
>>> You can always write your own custom Similarity class to override that
>>> calculation.
>>> 
>>> -- Jack Krupansky
>>> 
>>> On Wed, Apr 20, 2016 at 10:43 AM, <jimi.hulleg...@svensktnaringsliv.se>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> In general I think that the fieldNorm factor in the score calculation
>>>> is quite good. But when the text is short I think that the effect is
>> two
>>> big.
>>>> 
>>>> Ie with two documents that have a short text in the same field, just a
>>>> few characters extra in of the documents lower the fieldNorm factor too
>>> much.
>>>> In one test the text in document 1 is 30 characters long and has
>>>> fieldNorm 0.4375, and in document 2 the text is 37 characters long and
>>>> has fieldNorm 0.375. That means that the first document gets almost a
>>>> 20% higher score simply because of the 7 character difference.
>>>> 
>>>> What are my options if I want to change this behavior? Can I set a
>>>> lower character limit, meaning that all fields with a length below
>>>> this limit gets the same fieldNorm value?
>>>> 
>>>> I know I can force fieldNorm to be 1 by setting omitNorms="true" for
>>>> that field, but I would prefer to still have it, just limit its effect
>>>> on short texts.
>>>> 
>>>> Regards
>>>> /Jimi
>>>> 
>>>> 
>>>> 
>>> 
>> 

Reply via email to