Hi Ivan, Hmm... This seems like a viable workaround however just wanted to know is there any other ways to do it ? Because this doesn't seems like a unique problem i guess as most users will expect to get the similarity sorted (when performing search) based on the following order:
1.Happy 2.Be Happy 3.Be Happy 4.Happy Together It is live data in production.I had 180k documents resided in 5 shards within 5 nodes with one replica each. Even with 180k documents i still having this similarity order issue coupled with inconsistency issue due to it fetch from primary and replica intermittently. Therefore i need to use /media/_search?pretty=&search_type=dfs_query_then_fetch&preference=_primary to solve the inconsistency and now left with this sorting to be solve. Thanks. On Mon, Apr 7, 2014 at 7:13 AM, Ivan Brusic <i...@brusic.com> wrote: > You can index the number of characters in your string into a new field and > then do a secondary sort on this field. > > Are you testing against real data or only against some test set? The > Lucene scoring model will improve with the addition of more documents. As > more documents are added, the term frequencies and inverse document > frequencies start to diverge and contribute more to the scoring. You will > not have many documents with the same score. > > -- > Ivan > > > On Sun, Apr 6, 2014 at 12:38 AM, <cheeho...@gmail.com> wrote: > >> >> Hi Ivan, >> >> Because I wanted the similiar result sorted in this way : >> >> 1. Be happy >> 2. Be happy >> 3. Happy ways >> >> Currently it is sorted : >> 1. Be happy >> 2. Happy ways >> 3. Be happy >> >> Due to that it return the same scoring. Any suggestion ? >> >> Thanks >> >> On 6 Apr, 2014, at 4:24 am, Ivan Brusic <i...@brusic.com> wrote: >> >> Lucene will indeed, by default, give a higher score to shorter text, but >> the "shortness" is the number of tokens, not the number of characters. In >> your last example, each field has two tokens, so the length is the same. >> The term frequency is also the same for each document ("happy" appears >> once) and the inverse document frequency is the same (always the case with >> single term queries), so the score will be exactly the same for every >> document. Why should the scoring by any different? >> >> Cheers, >> >> Ivan >> >> >> >> On Fri, Apr 4, 2014 at 10:31 PM, chee hoo lum <cheeho...@gmail.com>wrote: >> >>> Hi Ivan, >>> >>> Since i not sure how analyzer with stopwords can be set in the query >>> itself. I tried to set the stopwords="_none_" via >>> index and its mapping : >>> >>> *Index settings: * >>> >>> { >>> "jdbc_dev": { >>> "settings": { >>> "index.analysis.analyzer.string_lowercase.filter": >>> "lowercase", >>> "index.number_of_replicas": "1", >>> "index.analysis.analyzer.string_lowercase.tokenizer": >>> "keyword", >>> "index.number_of_shards": "5", >>> "index.version.created": "900199", >>> * "index.analysis.analyzer.standard.type": "standard",* >>> * "index.analysis.analyzer.standard.stopwords": "_none_"* >>> } >>> } >>> } >>> >>> >>> *Type Mapping :* >>> >>> { >>> "media": { >>> "properties": { >>> "AUDIO": { >>> "type": "string" >>> }, >>> .... >>> "DISPLAY_NAME": { >>> "type": "string", >>> * "analyzer": "standard"* >>> }, >>> .... >>> } >>> } >>> >>> >>> *Query : * >>> >>> /media/_search?pretty=&search_type=dfs_query_then_fetch& >>> preference=_primary >>> >>> { >>> "from" : 0, >>> "size" : 100, >>> "explain" : true, >>> "query" : { >>> >>> "filtered" : { >>> "query" : { >>> "multi_match": { >>> "query": "happy", >>> "fields": [ "DISPLAY_NAME" ] >>> } >>> }, >>> "filter" : { >>> "query" : { >>> "bool" : { >>> "must" : { >>> "term" : { >>> "CHANNEL_ID" : "1" >>> } >>> } >>> } >>> } >>> } >>> } >>> } >>> >>> } >>> >>> >>> *Result : * >>> >>> 1) >>> "_shard": *4*, >>> "_node": "xsGVhtTnThaG57_mJdMtxg", >>> "_index": "jdbc_dev", >>> "_type": "media", >>> "_id": "127413", >>> "_score":* 6.614289*, >>> "_source": { >>> "DISPLAY_NAME": "*Be Happy*", >>> , >>> "_explanation": { >>> "value": 6.614289, >>> "description": "weight(DISPLAY_NAME:happy in 6485) >>> [PerFieldSimilarity], result of:", >>> "details": [ >>> { >>> "value": 6.614289, >>> "description": "fieldWeight in 6485, product >>> of:", >>> "details": [ >>> { >>> "value": 1, >>> "description": "tf(freq=1.0), with >>> freq of:", >>> "details": [ >>> { >>> "value": 1, >>> "description": "termFreq=1.0" >>> } >>> ] >>> }, >>> { >>> "value": 10.582862, >>> "description": "idf(docFreq=93, >>> maxDocs=1364306)" >>> }, >>> { >>> "value": 0.625, >>> "description": "fieldNorm(doc=6485)" >>> } >>> ] >>> } >>> ] >>> } >>> >>> >>> 2) >>> "_shard": *4*, >>> "_node": "UOjX2lxhR6mzfjHHmTm3cQ", >>> "_index": "jdbc_dev", >>> "_type": "media", >>> "_id": "72253", >>> "_score": *6.614289*, >>> "_source": { >>> "DISPLAY_NAME": *"Happy Ways*", >>> "_explanation": { >>> "value": 6.614289, >>> "description": "weight(DISPLAY_NAME:happy in 1102) >>> [PerFieldSimilarity], result of:", >>> "details": [ >>> { >>> "value": 6.614289, >>> "description": "fieldWeight in 1102, product >>> of:", >>> "details": [ >>> { >>> "value": 1, >>> "description": "tf(freq=1.0), with >>> freq of:", >>> "details": [ >>> { >>> "value": 1, >>> "description": "termFreq=1.0" >>> } >>> ] >>> }, >>> { >>> "value": 10.582862, >>> "description": "idf(docFreq=93, >>> maxDocs=1364306)" >>> }, >>> { >>> "value": 0.625, >>> "description": "fieldNorm(doc=1102)" >>> } >>> ] >>> } >>> ] >>> } >>> >>> >>> 3) >>> "_shard":* 4*, >>> "_node": "UOjX2lxhR6mzfjHHmTm3cQ", >>> "_index": "jdbc_dev", >>> "_type": "media", >>> "_id": "127413", >>> "_score": 6.614289, >>> "_source": { >>> "DISPLAY_NAME": "*Be Happy*", >>> "_explanation": { >>> "value": *6.614289*, >>> "description": "weight(DISPLAY_NAME:happy in 7277) >>> [PerFieldSimilarity], result of:", >>> "details": [ >>> { >>> "value": 6.614289, >>> "description": "fieldWeight in 7277, product >>> of:", >>> "details": [ >>> { >>> "value": 1, >>> "description": "tf(freq=1.0), with >>> freq of:", >>> "details": [ >>> { >>> "value": 1, >>> "description": "termFreq=1.0" >>> } >>> ] >>> }, >>> { >>> "value": 10.582862, >>> "description": "idf(docFreq=93, >>> maxDocs=1364306)" >>> }, >>> { >>> "value": 0.625, >>> "description": "fieldNorm(doc=7277)" >>> } >>> ] >>> } >>> ] >>> } >>> >>> >>> Notice that from 1,2,3 items the scores are the same *6.614289* even >>> though the DISPLAY_NAME is different >>> 1) Be Happy >>> 2) Happy Ways >>> 3) Be Happy >>> >>> It looks like it doesn't take into consideration the number of >>> character/length when it compute the score. I remember somewhere in the >>> document indicate that by default the algorithm should give higher score to >>> the document that have shorter text on the searched field however this >>> doesn't seem like the case. Also i didn't manually disable the norm. >>> >>> Any suggestion that i could circumvent this issue ? >>> >>> >>> >>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> elasticsearch+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearch+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com<https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- Regards, Chee Hoo -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.