Re: Relevancy sorting of result returned

chee hoo lum Mon, 07 Apr 2014 08:24:12 -0700

Hi Ivan,

Hmm... This seems like a viable workaround however just wanted to know is
there any other ways to do it ?
Because this doesn't seems like a unique problem i guess as most users will
expect to get the similarity sorted (when performing search) based on the
following order:


1.Happy
2.Be Happy
3.Be Happy
4.Happy Together

It is live data in production.I had 180k documents resided in 5 shards
within 5 nodes with one replica each. Even with 180k documents i still
having this similarity order issue coupled with inconsistency issue due to
it fetch from primary and replica intermittently. Therefore i need to use
/media/_search?pretty=&search_type=dfs_query_then_fetch&preference=_primary
to solve the inconsistency and now left with this sorting to be solve.

Thanks.



On Mon, Apr 7, 2014 at 7:13 AM, Ivan Brusic <i...@brusic.com> wrote:

> You can index the number of characters in your string into a new field and
> then do a secondary sort on this field.
>
> Are you testing against real data or only against some test set? The
> Lucene scoring model will improve with the addition of more documents. As
> more documents are added, the term frequencies and inverse document
> frequencies start to diverge and contribute more to the scoring. You will
> not have many documents with the same score.
>
> --
> Ivan
>
>
> On Sun, Apr 6, 2014 at 12:38 AM, <cheeho...@gmail.com> wrote:
>
>>
>> Hi Ivan,
>>
>> Because I wanted the similiar result sorted in this way :
>>
>> 1. Be happy
>> 2. Be happy
>> 3. Happy ways
>>
>> Currently it is sorted :
>> 1. Be happy
>> 2. Happy ways
>> 3. Be happy
>>
>> Due to that it return the same scoring. Any suggestion ?
>>
>> Thanks
>>
>> On 6 Apr, 2014, at 4:24 am, Ivan Brusic <i...@brusic.com> wrote:
>>
>> Lucene will indeed, by default, give a higher score to shorter text, but
>> the "shortness" is the number of tokens, not the number of characters. In
>> your last example, each field has two tokens, so the length is the same.
>> The term frequency is also the same for each document ("happy" appears
>> once) and the inverse document frequency is the same (always the case with
>> single term queries), so the score will be exactly the same for every
>> document. Why should the scoring by any different?
>>
>> Cheers,
>>
>> Ivan
>>
>>
>>
>> On Fri, Apr 4, 2014 at 10:31 PM, chee hoo lum <cheeho...@gmail.com>wrote:
>>
>>> Hi Ivan,
>>>
>>> Since i not sure how analyzer with stopwords can be set in the query
>>> itself. I tried to set the stopwords="_none_" via
>>> index and its mapping :
>>>
>>> *Index settings: *
>>>
>>> {
>>>     "jdbc_dev": {
>>>         "settings": {
>>>             "index.analysis.analyzer.string_lowercase.filter":
>>> "lowercase",
>>>             "index.number_of_replicas": "1",
>>>             "index.analysis.analyzer.string_lowercase.tokenizer":
>>> "keyword",
>>>             "index.number_of_shards": "5",
>>>             "index.version.created": "900199",
>>>          *   "index.analysis.analyzer.standard.type": "standard",*
>>> *            "index.analysis.analyzer.standard.stopwords": "_none_"*
>>>         }
>>>     }
>>> }
>>>
>>>
>>> *Type Mapping :*
>>>
>>> {
>>>     "media": {
>>>         "properties": {
>>>             "AUDIO": {
>>>                 "type": "string"
>>>             },
>>>          ....
>>>          "DISPLAY_NAME": {
>>>                 "type": "string",
>>>               *  "analyzer": "standard"*
>>>             },
>>>          ....
>>>    }
>>> }
>>>
>>>
>>> *Query : *
>>>
>>> /media/_search?pretty=&search_type=dfs_query_then_fetch&
>>> preference=_primary
>>>
>>> {
>>>   "from" : 0,
>>>   "size" : 100,
>>>   "explain" : true,
>>>   "query" : {
>>>
>>>     "filtered" : {
>>>       "query" : {
>>>          "multi_match": {
>>>        "query": "happy",
>>>        "fields": [ "DISPLAY_NAME" ]
>>>     }
>>>       },
>>>       "filter" : {
>>>         "query" : {
>>>           "bool" : {
>>>           "must" : {
>>>             "term" : {
>>>               "CHANNEL_ID" : "1"
>>>             }
>>>           }
>>>         }
>>>         }
>>>       }
>>>     }
>>>   }
>>>
>>> }
>>>
>>>
>>> *Result : *
>>>
>>> 1)
>>>  "_shard": *4*,
>>>                 "_node": "xsGVhtTnThaG57_mJdMtxg",
>>>                 "_index": "jdbc_dev",
>>>                 "_type": "media",
>>>                 "_id": "127413",
>>>                 "_score":* 6.614289*,
>>>                 "_source": {
>>>                     "DISPLAY_NAME": "*Be Happy*",
>>>                 ,
>>>                 "_explanation": {
>>>                     "value": 6.614289,
>>>                     "description": "weight(DISPLAY_NAME:happy in 6485)
>>> [PerFieldSimilarity], result of:",
>>>                     "details": [
>>>                         {
>>>                             "value": 6.614289,
>>>                             "description": "fieldWeight in 6485, product
>>> of:",
>>>                              "details": [
>>>                                 {
>>>                                     "value": 1,
>>>                                     "description": "tf(freq=1.0), with
>>> freq of:",
>>>                                     "details": [
>>>                                         {
>>>                                             "value": 1,
>>>                                             "description": "termFreq=1.0"
>>>                                         }
>>>                                     ]
>>>                                 },
>>>                                 {
>>>                                     "value": 10.582862,
>>>                                     "description": "idf(docFreq=93,
>>> maxDocs=1364306)"
>>>                                 },
>>>                                 {
>>>                                     "value": 0.625,
>>>                                     "description": "fieldNorm(doc=6485)"
>>>                                 }
>>>                             ]
>>>                         }
>>>                     ]
>>>                 }
>>>
>>>
>>> 2)
>>>  "_shard": *4*,
>>>                 "_node": "UOjX2lxhR6mzfjHHmTm3cQ",
>>>                  "_index": "jdbc_dev",
>>>                 "_type": "media",
>>>                 "_id": "72253",
>>>                 "_score": *6.614289*,
>>>                 "_source": {
>>>                     "DISPLAY_NAME": *"Happy Ways*",
>>>                   "_explanation": {
>>>                     "value": 6.614289,
>>>                     "description": "weight(DISPLAY_NAME:happy in 1102)
>>> [PerFieldSimilarity], result of:",
>>>                     "details": [
>>>                         {
>>>                             "value": 6.614289,
>>>                             "description": "fieldWeight in 1102, product
>>> of:",
>>>                             "details": [
>>>                                 {
>>>                                     "value": 1,
>>>                                     "description": "tf(freq=1.0), with
>>> freq of:",
>>>                                     "details": [
>>>                                         {
>>>                                             "value": 1,
>>>                                             "description": "termFreq=1.0"
>>>                                         }
>>>                                     ]
>>>                                 },
>>>                                 {
>>>                                     "value": 10.582862,
>>>                                     "description": "idf(docFreq=93,
>>> maxDocs=1364306)"
>>>                                 },
>>>                                 {
>>>                                     "value": 0.625,
>>>                                     "description": "fieldNorm(doc=1102)"
>>>                                 }
>>>                             ]
>>>                         }
>>>                     ]
>>>                 }
>>>
>>>
>>> 3)
>>>  "_shard":* 4*,
>>>                 "_node": "UOjX2lxhR6mzfjHHmTm3cQ",
>>>                  "_index": "jdbc_dev",
>>>                 "_type": "media",
>>>                 "_id": "127413",
>>>                 "_score": 6.614289,
>>>                  "_source": {
>>>                     "DISPLAY_NAME": "*Be Happy*",
>>>                  "_explanation": {
>>>                     "value": *6.614289*,
>>>                     "description": "weight(DISPLAY_NAME:happy in 7277)
>>> [PerFieldSimilarity], result of:",
>>>                     "details": [
>>>                         {
>>>                             "value": 6.614289,
>>>                             "description": "fieldWeight in 7277, product
>>> of:",
>>>                              "details": [
>>>                                 {
>>>                                     "value": 1,
>>>                                     "description": "tf(freq=1.0), with
>>> freq of:",
>>>                                     "details": [
>>>                                         {
>>>                                             "value": 1,
>>>                                             "description": "termFreq=1.0"
>>>                                         }
>>>                                     ]
>>>                                 },
>>>                                 {
>>>                                     "value": 10.582862,
>>>                                     "description": "idf(docFreq=93,
>>> maxDocs=1364306)"
>>>                                 },
>>>                                 {
>>>                                     "value": 0.625,
>>>                                     "description": "fieldNorm(doc=7277)"
>>>                                 }
>>>                             ]
>>>                         }
>>>                     ]
>>>                 }
>>>
>>>
>>> Notice that from 1,2,3 items the scores are the same *6.614289* even
>>> though the DISPLAY_NAME is different
>>> 1) Be Happy
>>> 2) Happy Ways
>>> 3) Be Happy
>>>
>>> It looks like it doesn't take into consideration the number of
>>> character/length when it compute the score. I remember somewhere in the
>>> document indicate that by default the algorithm should give higher score to
>>> the document that have shorter text on the searched field however this
>>> doesn't seem like the case. Also i didn't manually disable the norm.
>>>
>>> Any suggestion that i could circumvent this issue ?
>>>
>>>
>>>
>>>  --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com<https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Regards,

Chee Hoo

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Relevancy sorting of result returned

Reply via email to