Re: Synonym filter results in term facet

2014-07-21 Thread vineeth mohan
Hello Ravi ,

Your approach is wrong.
When you use synonym filter , it indexes all synonyms of that token hence
and synonym will match against that term.
So when you do a facet , you will get an aggregation of all synonyms rather
than just one.

Better approach would be to store the unique name into some other field and
take a facet of that field.

Thanks
   Vineeth


On Mon, Jul 21, 2014 at 11:21 PM,  wrote:

> Hi All,
>
> I have a requirement in which I need to find distinct company names. I was
> using "Keyword" tokenizer for that field and through term facet I was able
> to get distinct company names. However terms facet treated company names
> like "ibm suisse", "ibm corporation", "ibm" as different companies.
> Online documentation suggested me to use "Synonym filter" to solve this.
> My settings is:
>
> curl -XPUT 'http://localhost:9200/dataindex/' -d '{
>   "settings": {
> "index": {
>   "analysis": {
> "analyzer": {
>  "customAnalyzer": {
> "type": "custom",
> "tokenizer": "whitespace",
> "filter": [
>   "lowercase","synonym"
> ]
>   }
> },
> "filter": {
>   "synonym" : {
>   "type" : "synonym",
>   "tokenizer": "keyword",
>   "synonyms_path" : "analysis/synonym.txt"
>   }
> }
>   }
> }
>   }
> }'
>
> My mapping is:
>
> curl -XPUT 'http://localhost:9200/dataindex/tweet/_mapping' -d '
> {
> "tweet" : {
> "properties" : {
> "company": {
>  "type": "string",
>  "analyzer": "customAnalyzer"
> }
> }
> }
> }'
>
> In the synonym.txt file I have : ibm suisse, ibm corporation, ibm
> business, ibm => ibm corp ltd
>
> Indexed data:
> curl -XPUT 'http://localhost:9200/dataindex/tweet/1' -d '{
> "company" : "ibm"
> }'
> curl -XPUT 'http://localhost:9200/dataindex/tweet/2' -d '{
> "company" : "ibm corporation"
> }'
> curl -XPUT 'http://localhost:9200/dataindex/tweet/3' -d '{
> "company" : "ibm suisse"
> }'
> curl -XPUT 'http://localhost:9200/dataindex/tweet/4' -d '{
> "company" : "ibm business"
> }'
>
> If I run a terms facet:
> {
>   "facets": {
> "loc_facet": {
>   "terms": {
> "field": "company"
>   }
> }
>   }
> }
> I get 3 terms ie {term: ibm corp ltd, count: 3} {term: suisse, count: 1}
> {term: corporation, count: 1}
> I want the facet result to return only one term: ibm corp ltd with
> count=3. This way i will get distinct company names and also map synonym
> names into single company name.
> Please correct me if I am using wrong tokenizer or my approach is not
> correct.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/1ba32926-7015-4b8a-89ae-bf43a2561b71%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5ny%3Di76CHwpbEoY-4nGaraQfz-Tmmm5MVJbiA%2B0nrgKZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Synonym filter results in term facet

2014-07-21 Thread ravi063
Hi All,

I have a requirement in which I need to find distinct company names. I was 
using "Keyword" tokenizer for that field and through term facet I was able 
to get distinct company names. However terms facet treated company names 
like "ibm suisse", "ibm corporation", "ibm" as different companies.
Online documentation suggested me to use "Synonym filter" to solve this. My 
settings is:

curl -XPUT 'http://localhost:9200/dataindex/' -d '{
  "settings": {
"index": {
  "analysis": {
"analyzer": {
 "customAnalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
  "lowercase","synonym"  
]
  }
},
"filter": {
  "synonym" : {
  "type" : "synonym",
  "tokenizer": "keyword",
  "synonyms_path" : "analysis/synonym.txt"
  }
}
  }
}
  }
}'

My mapping is:

curl -XPUT 'http://localhost:9200/dataindex/tweet/_mapping' -d '
{ 
"tweet" : {
"properties" : {
"company": {
 "type": "string",
 "analyzer": "customAnalyzer"
}  
}   
}   
}'

In the synonym.txt file I have : ibm suisse, ibm corporation, ibm business, 
ibm => ibm corp ltd

Indexed data:
curl -XPUT 'http://localhost:9200/dataindex/tweet/1' -d '{
"company" : "ibm"
}'
curl -XPUT 'http://localhost:9200/dataindex/tweet/2' -d '{
"company" : "ibm corporation"
}'
curl -XPUT 'http://localhost:9200/dataindex/tweet/3' -d '{
"company" : "ibm suisse"
}'
curl -XPUT 'http://localhost:9200/dataindex/tweet/4' -d '{
"company" : "ibm business"
}'

If I run a terms facet:
{
  "facets": {
"loc_facet": {
  "terms": {
"field": "company"
  }
}
  }
}
I get 3 terms ie {term: ibm corp ltd, count: 3} {term: suisse, count: 1} 
{term: corporation, count: 1}
I want the facet result to return only one term: ibm corp ltd with count=3. 
This way i will get distinct company names and also map synonym names into 
single company name.
Please correct me if I am using wrong tokenizer or my approach is not 
correct.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1ba32926-7015-4b8a-89ae-bf43a2561b71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.