Re: Relevancy sorting of result returned

Ivan Brusic Mon, 07 Apr 2014 22:10:23 -0700

I do not think most users would expect the results in that order. The
character length does not provide relevance for most cases. Why is a
shorter word more relevant? I would say that most would rank "Happy
Together" higher since word proximity is a helpful metric. Happy should
rank first due to the length norm.


You can always play around with the function score, but I rather deal with
non-dynamic metrics at indexing time.

-- 
Ivan


On Mon, Apr 7, 2014 at 8:23 AM, chee hoo lum <cheeho...@gmail.com> wrote:

> Hi Ivan,
>
> Hmm... This seems like a viable workaround however just wanted to know is
> there any other ways to do it ?
> Because this doesn't seems like a unique problem i guess as most users
> will expect to get the similarity sorted (when performing search) based on
> the following order:
>
> 1.Happy
> 2.Be Happy
> 3.Be Happy
> 4.Happy Together
>
> It is live data in production.I had 180k documents resided in 5 shards
> within 5 nodes with one replica each. Even with 180k documents i still
> having this similarity order issue coupled with inconsistency issue due to
> it fetch from primary and replica intermittently. Therefore i need to use
> /media/_search?pretty=&search_type=dfs_query_then_fetch&preference=_primary
> to solve the inconsistency and now left with this sorting to be solve.
>
> Thanks.
>
>
>
> On Mon, Apr 7, 2014 at 7:13 AM, Ivan Brusic <i...@brusic.com> wrote:
>
>> You can index the number of characters in your string into a new field
>> and then do a secondary sort on this field.
>>
>> Are you testing against real data or only against some test set? The
>> Lucene scoring model will improve with the addition of more documents. As
>> more documents are added, the term frequencies and inverse document
>> frequencies start to diverge and contribute more to the scoring. You will
>> not have many documents with the same score.
>>
>> --
>> Ivan
>>
>>
>> On Sun, Apr 6, 2014 at 12:38 AM, <cheeho...@gmail.com> wrote:
>>
>>>
>>> Hi Ivan,
>>>
>>> Because I wanted the similiar result sorted in this way :
>>>
>>> 1. Be happy
>>> 2. Be happy
>>> 3. Happy ways
>>>
>>> Currently it is sorted :
>>> 1. Be happy
>>> 2. Happy ways
>>> 3. Be happy
>>>
>>> Due to that it return the same scoring. Any suggestion ?
>>>
>>> Thanks
>>>
>>> On 6 Apr, 2014, at 4:24 am, Ivan Brusic <i...@brusic.com> wrote:
>>>
>>> Lucene will indeed, by default, give a higher score to shorter text, but
>>> the "shortness" is the number of tokens, not the number of characters. In
>>> your last example, each field has two tokens, so the length is the same.
>>> The term frequency is also the same for each document ("happy" appears
>>> once) and the inverse document frequency is the same (always the case with
>>> single term queries), so the score will be exactly the same for every
>>> document. Why should the scoring by any different?
>>>
>>> Cheers,
>>>
>>> Ivan
>>>
>>>
>>>
>>> On Fri, Apr 4, 2014 at 10:31 PM, chee hoo lum <cheeho...@gmail.com>wrote:
>>>
>>>> Hi Ivan,
>>>>
>>>> Since i not sure how analyzer with stopwords can be set in the query
>>>> itself. I tried to set the stopwords="_none_" via
>>>> index and its mapping :
>>>>
>>>> *Index settings: *
>>>>
>>>> {
>>>>     "jdbc_dev": {
>>>>         "settings": {
>>>>             "index.analysis.analyzer.string_lowercase.filter":
>>>> "lowercase",
>>>>             "index.number_of_replicas": "1",
>>>>             "index.analysis.analyzer.string_lowercase.tokenizer":
>>>> "keyword",
>>>>             "index.number_of_shards": "5",
>>>>             "index.version.created": "900199",
>>>>          *   "index.analysis.analyzer.standard.type": "standard",*
>>>> *            "index.analysis.analyzer.standard.stopwords": "_none_"*
>>>>         }
>>>>     }
>>>> }
>>>>
>>>>
>>>> *Type Mapping :*
>>>>
>>>> {
>>>>     "media": {
>>>>         "properties": {
>>>>             "AUDIO": {
>>>>                 "type": "string"
>>>>             },
>>>>          ....
>>>>          "DISPLAY_NAME": {
>>>>                 "type": "string",
>>>>               *  "analyzer": "standard"*
>>>>             },
>>>>          ....
>>>>    }
>>>> }
>>>>
>>>>
>>>> *Query : *
>>>>
>>>> /media/_search?pretty=&search_type=dfs_query_then_fetch&
>>>> preference=_primary
>>>>
>>>> {
>>>>   "from" : 0,
>>>>   "size" : 100,
>>>>   "explain" : true,
>>>>   "query" : {
>>>>
>>>>     "filtered" : {
>>>>       "query" : {
>>>>          "multi_match": {
>>>>        "query": "happy",
>>>>        "fields": [ "DISPLAY_NAME" ]
>>>>     }
>>>>       },
>>>>       "filter" : {
>>>>         "query" : {
>>>>           "bool" : {
>>>>           "must" : {
>>>>             "term" : {
>>>>               "CHANNEL_ID" : "1"
>>>>             }
>>>>           }
>>>>         }
>>>>         }
>>>>       }
>>>>     }
>>>>   }
>>>>
>>>> }
>>>>
>>>>
>>>> *Result : *
>>>>
>>>> 1)
>>>>  "_shard": *4*,
>>>>                 "_node": "xsGVhtTnThaG57_mJdMtxg",
>>>>                 "_index": "jdbc_dev",
>>>>                 "_type": "media",
>>>>                 "_id": "127413",
>>>>                 "_score":* 6.614289*,
>>>>                 "_source": {
>>>>                     "DISPLAY_NAME": "*Be Happy*",
>>>>                 ,
>>>>                 "_explanation": {
>>>>                     "value": 6.614289,
>>>>                     "description": "weight(DISPLAY_NAME:happy in 6485)
>>>> [PerFieldSimilarity], result of:",
>>>>                     "details": [
>>>>                         {
>>>>                             "value": 6.614289,
>>>>                             "description": "fieldWeight in 6485,
>>>> product of:",
>>>>                              "details": [
>>>>                                 {
>>>>                                     "value": 1,
>>>>                                     "description": "tf(freq=1.0), with
>>>> freq of:",
>>>>                                     "details": [
>>>>                                         {
>>>>                                             "value": 1,
>>>>                                             "description":
>>>> "termFreq=1.0"
>>>>                                         }
>>>>                                     ]
>>>>                                 },
>>>>                                 {
>>>>                                     "value": 10.582862,
>>>>                                     "description": "idf(docFreq=93,
>>>> maxDocs=1364306)"
>>>>                                 },
>>>>                                 {
>>>>                                     "value": 0.625,
>>>>                                     "description": "fieldNorm(doc=6485)"
>>>>                                 }
>>>>                             ]
>>>>                         }
>>>>                     ]
>>>>                 }
>>>>
>>>>
>>>> 2)
>>>>  "_shard": *4*,
>>>>                 "_node": "UOjX2lxhR6mzfjHHmTm3cQ",
>>>>                  "_index": "jdbc_dev",
>>>>                 "_type": "media",
>>>>                 "_id": "72253",
>>>>                 "_score": *6.614289*,
>>>>                 "_source": {
>>>>                     "DISPLAY_NAME": *"Happy Ways*",
>>>>                   "_explanation": {
>>>>                     "value": 6.614289,
>>>>                     "description": "weight(DISPLAY_NAME:happy in 1102)
>>>> [PerFieldSimilarity], result of:",
>>>>                     "details": [
>>>>                         {
>>>>                             "value": 6.614289,
>>>>                             "description": "fieldWeight in 1102,
>>>> product of:",
>>>>                             "details": [
>>>>                                 {
>>>>                                     "value": 1,
>>>>                                     "description": "tf(freq=1.0), with
>>>> freq of:",
>>>>                                     "details": [
>>>>                                         {
>>>>                                             "value": 1,
>>>>                                             "description":
>>>> "termFreq=1.0"
>>>>                                         }
>>>>                                     ]
>>>>                                 },
>>>>                                 {
>>>>                                     "value": 10.582862,
>>>>                                     "description": "idf(docFreq=93,
>>>> maxDocs=1364306)"
>>>>                                 },
>>>>                                 {
>>>>                                     "value": 0.625,
>>>>                                     "description": "fieldNorm(doc=1102)"
>>>>                                 }
>>>>                             ]
>>>>                         }
>>>>                     ]
>>>>                 }
>>>>
>>>>
>>>> 3)
>>>>  "_shard":* 4*,
>>>>                 "_node": "UOjX2lxhR6mzfjHHmTm3cQ",
>>>>                  "_index": "jdbc_dev",
>>>>                 "_type": "media",
>>>>                 "_id": "127413",
>>>>                 "_score": 6.614289,
>>>>                  "_source": {
>>>>                     "DISPLAY_NAME": "*Be Happy*",
>>>>                  "_explanation": {
>>>>                     "value": *6.614289*,
>>>>                     "description": "weight(DISPLAY_NAME:happy in 7277)
>>>> [PerFieldSimilarity], result of:",
>>>>                     "details": [
>>>>                         {
>>>>                             "value": 6.614289,
>>>>                             "description": "fieldWeight in 7277,
>>>> product of:",
>>>>                              "details": [
>>>>                                 {
>>>>                                     "value": 1,
>>>>                                     "description": "tf(freq=1.0), with
>>>> freq of:",
>>>>                                     "details": [
>>>>                                         {
>>>>                                             "value": 1,
>>>>                                             "description":
>>>> "termFreq=1.0"
>>>>                                         }
>>>>                                     ]
>>>>                                 },
>>>>                                 {
>>>>                                     "value": 10.582862,
>>>>                                     "description": "idf(docFreq=93,
>>>> maxDocs=1364306)"
>>>>                                 },
>>>>                                 {
>>>>                                     "value": 0.625,
>>>>                                     "description": "fieldNorm(doc=7277)"
>>>>                                 }
>>>>                             ]
>>>>                         }
>>>>                     ]
>>>>                 }
>>>>
>>>>
>>>> Notice that from 1,2,3 items the scores are the same *6.614289* even
>>>> though the DISPLAY_NAME is different
>>>> 1) Be Happy
>>>> 2) Happy Ways
>>>> 3) Be Happy
>>>>
>>>> It looks like it doesn't take into consideration the number of
>>>> character/length when it compute the score. I remember somewhere in the
>>>> document indicate that by default the algorithm should give higher score to
>>>> the document that have shorter text on the searched field however this
>>>> doesn't seem like the case. Also i didn't manually disable the norm.
>>>>
>>>> Any suggestion that i could circumvent this issue ?
>>>>
>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "elasticsearch" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com<https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Regards,
>
> Chee Hoo
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCGjs7koyQAdr9A%3DZoiQsCeWpSNKce892uoun29ZbBi8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Relevancy sorting of result returned

Reply via email to