I do not think most users would expect the results in that order. The character length does not provide relevance for most cases. Why is a shorter word more relevant? I would say that most would rank "Happy Together" higher since word proximity is a helpful metric. Happy should rank first due to the length norm.
You can always play around with the function score, but I rather deal with non-dynamic metrics at indexing time. -- Ivan On Mon, Apr 7, 2014 at 8:23 AM, chee hoo lum <cheeho...@gmail.com> wrote: > Hi Ivan, > > Hmm... This seems like a viable workaround however just wanted to know is > there any other ways to do it ? > Because this doesn't seems like a unique problem i guess as most users > will expect to get the similarity sorted (when performing search) based on > the following order: > > 1.Happy > 2.Be Happy > 3.Be Happy > 4.Happy Together > > It is live data in production.I had 180k documents resided in 5 shards > within 5 nodes with one replica each. Even with 180k documents i still > having this similarity order issue coupled with inconsistency issue due to > it fetch from primary and replica intermittently. Therefore i need to use > /media/_search?pretty=&search_type=dfs_query_then_fetch&preference=_primary > to solve the inconsistency and now left with this sorting to be solve. > > Thanks. > > > > On Mon, Apr 7, 2014 at 7:13 AM, Ivan Brusic <i...@brusic.com> wrote: > >> You can index the number of characters in your string into a new field >> and then do a secondary sort on this field. >> >> Are you testing against real data or only against some test set? The >> Lucene scoring model will improve with the addition of more documents. As >> more documents are added, the term frequencies and inverse document >> frequencies start to diverge and contribute more to the scoring. You will >> not have many documents with the same score. >> >> -- >> Ivan >> >> >> On Sun, Apr 6, 2014 at 12:38 AM, <cheeho...@gmail.com> wrote: >> >>> >>> Hi Ivan, >>> >>> Because I wanted the similiar result sorted in this way : >>> >>> 1. Be happy >>> 2. Be happy >>> 3. Happy ways >>> >>> Currently it is sorted : >>> 1. Be happy >>> 2. Happy ways >>> 3. Be happy >>> >>> Due to that it return the same scoring. Any suggestion ? >>> >>> Thanks >>> >>> On 6 Apr, 2014, at 4:24 am, Ivan Brusic <i...@brusic.com> wrote: >>> >>> Lucene will indeed, by default, give a higher score to shorter text, but >>> the "shortness" is the number of tokens, not the number of characters. In >>> your last example, each field has two tokens, so the length is the same. >>> The term frequency is also the same for each document ("happy" appears >>> once) and the inverse document frequency is the same (always the case with >>> single term queries), so the score will be exactly the same for every >>> document. Why should the scoring by any different? >>> >>> Cheers, >>> >>> Ivan >>> >>> >>> >>> On Fri, Apr 4, 2014 at 10:31 PM, chee hoo lum <cheeho...@gmail.com>wrote: >>> >>>> Hi Ivan, >>>> >>>> Since i not sure how analyzer with stopwords can be set in the query >>>> itself. I tried to set the stopwords="_none_" via >>>> index and its mapping : >>>> >>>> *Index settings: * >>>> >>>> { >>>> "jdbc_dev": { >>>> "settings": { >>>> "index.analysis.analyzer.string_lowercase.filter": >>>> "lowercase", >>>> "index.number_of_replicas": "1", >>>> "index.analysis.analyzer.string_lowercase.tokenizer": >>>> "keyword", >>>> "index.number_of_shards": "5", >>>> "index.version.created": "900199", >>>> * "index.analysis.analyzer.standard.type": "standard",* >>>> * "index.analysis.analyzer.standard.stopwords": "_none_"* >>>> } >>>> } >>>> } >>>> >>>> >>>> *Type Mapping :* >>>> >>>> { >>>> "media": { >>>> "properties": { >>>> "AUDIO": { >>>> "type": "string" >>>> }, >>>> .... >>>> "DISPLAY_NAME": { >>>> "type": "string", >>>> * "analyzer": "standard"* >>>> }, >>>> .... >>>> } >>>> } >>>> >>>> >>>> *Query : * >>>> >>>> /media/_search?pretty=&search_type=dfs_query_then_fetch& >>>> preference=_primary >>>> >>>> { >>>> "from" : 0, >>>> "size" : 100, >>>> "explain" : true, >>>> "query" : { >>>> >>>> "filtered" : { >>>> "query" : { >>>> "multi_match": { >>>> "query": "happy", >>>> "fields": [ "DISPLAY_NAME" ] >>>> } >>>> }, >>>> "filter" : { >>>> "query" : { >>>> "bool" : { >>>> "must" : { >>>> "term" : { >>>> "CHANNEL_ID" : "1" >>>> } >>>> } >>>> } >>>> } >>>> } >>>> } >>>> } >>>> >>>> } >>>> >>>> >>>> *Result : * >>>> >>>> 1) >>>> "_shard": *4*, >>>> "_node": "xsGVhtTnThaG57_mJdMtxg", >>>> "_index": "jdbc_dev", >>>> "_type": "media", >>>> "_id": "127413", >>>> "_score":* 6.614289*, >>>> "_source": { >>>> "DISPLAY_NAME": "*Be Happy*", >>>> , >>>> "_explanation": { >>>> "value": 6.614289, >>>> "description": "weight(DISPLAY_NAME:happy in 6485) >>>> [PerFieldSimilarity], result of:", >>>> "details": [ >>>> { >>>> "value": 6.614289, >>>> "description": "fieldWeight in 6485, >>>> product of:", >>>> "details": [ >>>> { >>>> "value": 1, >>>> "description": "tf(freq=1.0), with >>>> freq of:", >>>> "details": [ >>>> { >>>> "value": 1, >>>> "description": >>>> "termFreq=1.0" >>>> } >>>> ] >>>> }, >>>> { >>>> "value": 10.582862, >>>> "description": "idf(docFreq=93, >>>> maxDocs=1364306)" >>>> }, >>>> { >>>> "value": 0.625, >>>> "description": "fieldNorm(doc=6485)" >>>> } >>>> ] >>>> } >>>> ] >>>> } >>>> >>>> >>>> 2) >>>> "_shard": *4*, >>>> "_node": "UOjX2lxhR6mzfjHHmTm3cQ", >>>> "_index": "jdbc_dev", >>>> "_type": "media", >>>> "_id": "72253", >>>> "_score": *6.614289*, >>>> "_source": { >>>> "DISPLAY_NAME": *"Happy Ways*", >>>> "_explanation": { >>>> "value": 6.614289, >>>> "description": "weight(DISPLAY_NAME:happy in 1102) >>>> [PerFieldSimilarity], result of:", >>>> "details": [ >>>> { >>>> "value": 6.614289, >>>> "description": "fieldWeight in 1102, >>>> product of:", >>>> "details": [ >>>> { >>>> "value": 1, >>>> "description": "tf(freq=1.0), with >>>> freq of:", >>>> "details": [ >>>> { >>>> "value": 1, >>>> "description": >>>> "termFreq=1.0" >>>> } >>>> ] >>>> }, >>>> { >>>> "value": 10.582862, >>>> "description": "idf(docFreq=93, >>>> maxDocs=1364306)" >>>> }, >>>> { >>>> "value": 0.625, >>>> "description": "fieldNorm(doc=1102)" >>>> } >>>> ] >>>> } >>>> ] >>>> } >>>> >>>> >>>> 3) >>>> "_shard":* 4*, >>>> "_node": "UOjX2lxhR6mzfjHHmTm3cQ", >>>> "_index": "jdbc_dev", >>>> "_type": "media", >>>> "_id": "127413", >>>> "_score": 6.614289, >>>> "_source": { >>>> "DISPLAY_NAME": "*Be Happy*", >>>> "_explanation": { >>>> "value": *6.614289*, >>>> "description": "weight(DISPLAY_NAME:happy in 7277) >>>> [PerFieldSimilarity], result of:", >>>> "details": [ >>>> { >>>> "value": 6.614289, >>>> "description": "fieldWeight in 7277, >>>> product of:", >>>> "details": [ >>>> { >>>> "value": 1, >>>> "description": "tf(freq=1.0), with >>>> freq of:", >>>> "details": [ >>>> { >>>> "value": 1, >>>> "description": >>>> "termFreq=1.0" >>>> } >>>> ] >>>> }, >>>> { >>>> "value": 10.582862, >>>> "description": "idf(docFreq=93, >>>> maxDocs=1364306)" >>>> }, >>>> { >>>> "value": 0.625, >>>> "description": "fieldNorm(doc=7277)" >>>> } >>>> ] >>>> } >>>> ] >>>> } >>>> >>>> >>>> Notice that from 1,2,3 items the scores are the same *6.614289* even >>>> though the DISPLAY_NAME is different >>>> 1) Be Happy >>>> 2) Happy Ways >>>> 3) Be Happy >>>> >>>> It looks like it doesn't take into consideration the number of >>>> character/length when it compute the score. I remember somewhere in the >>>> document indicate that by default the algorithm should give higher score to >>>> the document that have shorter text on the searched field however this >>>> doesn't seem like the case. Also i didn't manually disable the norm. >>>> >>>> Any suggestion that i could circumvent this issue ? >>>> >>>> >>>> >>>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "elasticsearch" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> elasticsearch+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearch+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com<https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> elasticsearch+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Regards, > > Chee Hoo > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCGjs7koyQAdr9A%3DZoiQsCeWpSNKce892uoun29ZbBi8Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.