Hello Ravi ,

Your approach is wrong.
When you use synonym filter , it indexes all synonyms of that token hence
and synonym will match against that term.
So when you do a facet , you will get an aggregation of all synonyms rather
than just one.

Better approach would be to store the unique name into some other field and
take a facet of that field.

Thanks
           Vineeth


On Mon, Jul 21, 2014 at 11:21 PM, <ravi...@gmail.com> wrote:

> Hi All,
>
> I have a requirement in which I need to find distinct company names. I was
> using "Keyword" tokenizer for that field and through term facet I was able
> to get distinct company names. However terms facet treated company names
> like "ibm suisse", "ibm corporation", "ibm" as different companies.
> Online documentation suggested me to use "Synonym filter" to solve this.
> My settings is:
>
> curl -XPUT 'http://localhost:9200/dataindex/' -d '{
>   "settings": {
>     "index": {
>       "analysis": {
>         "analyzer": {
>          "customAnalyzer": {
>             "type": "custom",
>             "tokenizer": "whitespace",
>             "filter": [
>               "lowercase","synonym"
>             ]
>           }
>         },
>         "filter": {
>           "synonym" : {
>               "type" : "synonym",
>               "tokenizer": "keyword",
>               "synonyms_path" : "analysis/synonym.txt"
>           }
>         }
>       }
>     }
>   }
> }'
>
> My mapping is:
>
> curl -XPUT 'http://localhost:9200/dataindex/tweet/_mapping' -d '
> {
>     "tweet" : {
>         "properties" : {
>             "company": {
>                  "type": "string",
>                  "analyzer": "customAnalyzer"
>             }
>         }
>     }
> }'
>
> In the synonym.txt file I have : ibm suisse, ibm corporation, ibm
> business, ibm => ibm corp ltd
>
> Indexed data:
> curl -XPUT 'http://localhost:9200/dataindex/tweet/1' -d '{
>     "company" : "ibm"
> }'
> curl -XPUT 'http://localhost:9200/dataindex/tweet/2' -d '{
>     "company" : "ibm corporation"
> }'
> curl -XPUT 'http://localhost:9200/dataindex/tweet/3' -d '{
>     "company" : "ibm suisse"
> }'
> curl -XPUT 'http://localhost:9200/dataindex/tweet/4' -d '{
>     "company" : "ibm business"
> }'
>
> If I run a terms facet:
> {
>   "facets": {
>     "loc_facet": {
>       "terms": {
>         "field": "company"
>       }
>     }
>   }
> }
> I get 3 terms ie {term: ibm corp ltd, count: 3} {term: suisse, count: 1}
> {term: corporation, count: 1}
> I want the facet result to return only one term: ibm corp ltd with
> count=3. This way i will get distinct company names and also map synonym
> names into single company name.
> Please correct me if I am using wrong tokenizer or my approach is not
> correct.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/1ba32926-7015-4b8a-89ae-bf43a2561b71%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/1ba32926-7015-4b8a-89ae-bf43a2561b71%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5ny%3Di76CHwpbEoY-4nGaraQfz-Tmmm5MVJbiA%2B0nrgKZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to