Re: Using a char_filter in combination with a lowercase filter

Ivan Brusic Mon, 18 Aug 2014 21:37:49 -0700

Char filters are applied before the text is tokenized, and therefore they
are applied before the "normal" filters are used, which is why they are a
separate class of filter. With Lucene, the order is:


char filters -> tokenizer -> filters

Have you looked into the ICU analyzer?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-icu-plugin.html

I have no idea how well it works with Dutch.

Cheers,

Ivan


On Mon, Aug 18, 2014 at 2:14 AM, Matthias Hogerheijde <
matthias.hogerhei...@goabout.com> wrote:

> Hi,
>
> We're using Elasticsearch with an Analyzer to map the `y` character to
> `ij`, (*char_fitler* named "char_mapper") since in Dutch these two are
> "somewhat" interchangeable. We're also using a *lowercase filter*.
>
> This is the configuration:
>
> {
>   "analysis": {
>     "analyzer": {
>       "index": {
>         "type": "custom",
>         "tokenizer": "standard",
>         "filter": [
>           "lowercase",
>           "synonym_twoway",
>           "standard",
>           "asciifolding"
>         ],
>         "char_filter": [
>           "char_mapper"
>         ]
>       },
>       "index_prefix": {
>         "type": "custom",
>         "tokenizer": "standard",
>         "filter": [
>           "lowercase",
>           "synonym_twoway",
>           "standard",
>           "asciifolding",
>           "prefixes"
>         ],
>         "char_filter": [
>           "char_mapper"
>         ]
>       },
>       "search": {
>         "alias": [
>           "default"
>         ],
>         "type": "custom",
>         "tokenizer": "standard",
>         "filter": [
>           "lowercase",
>           "synonym",
>           "synonym_twoway",
>           "standard",
>           "asciifolding"
>         ],
>         "char_filter": [
>           "char_mapper"
>         ]
>       },
>       "postal_code": {
>         "tokenizer": "keyword",
>         "filter": [
>           "lowercase"
>         ]
>       }
>     },
>     "tokenizer": {
>       "standard": {
>         "stopwords": [
>
>
>         ]
>       }
>     },
>     "filter": {
>       "synonym": {
>         "type": "synonym",
>         "synonyms": [
>           "st => sint",
>           "jp => jan pieterszoon",
>           "mh => maarten harpertszoon"
>         ]
>       },
>       "synonym_twoway": {
>         "type": "synonym",
>         "synonyms": [
>           "den haag, s gravenhage",
>           "den bosch, s hertogenbosch"
>         ]
>       },
>       "prefixes": {
>         "type": "edgeNGram",
>         "side": "front",
>         "min_gram": 1,
>         "max_gram": 30
>       }
>     },
>     "char_filter": {
>       "char_mapper": {
>         "type": "mapping",
>         "mappings": [
>           "y => ij"
>         ]
>       }
>     }
>   }
> }
>
> When indexing cities, we're using this mapping:
>
> {
>   "properties": {
>     "city": {
>       "type": "multi_field",
>       "fields": {
>         "city": {
>           "type": "string"
>         },
>         "prefix": {
>           "type": "string",
>           "boost": 0.5,
>           "index_analyzer": "index_prefix"
>         }
>       }
>     },
>     "province_code": {
>       "type": "string"
>     },
>     "unique_name": {
>       "type": "boolean"
>     },
>     "point": {
>       "type": "geo_point"
>     },
>     "search_terms": {
>       "type": "multi_field",
>       "fields": {
>         "search_terms": {
>           "type": "string"
>         },
>         "prefix": {
>           "boost": 0.5,
>           "index_analyzer": "index_prefix",
>           "type": "string"
>         }
>       }
>     }
>   },
>   "search_analyzer": "search",
>   "index_analyzer": "index"
> }
>
> When we index all the (Dutch) cities from our data-source, there are
> cities starting with both `IJ` and `Y`. (for example, these citiy names
> exist: *IJssel*, *IJsselstein*, *Yerseke* and *Ysselsteyn.*) It seems
> that these characters are not lowercased before the char_mapping is
> applied.
>
> Querying the index, results in
>
> /top/city/_search?q=ijsselstein -> works, returns the document for
> IJsselstein
> /top/city/_search?q=Ijsselstein -> works, returns the document for
> IJsselstein
> /top/city/_search?q=yerseke -> *doesn't *work, returns nothing
> /top/city/_search?q=Yerseke -> *does *work, returns the document for
> Yerseke
> /top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing
> /top/city/_search?q=Ysselsteyn -> *does *work, returns the document for
> Ysselsteyn
>
> Changing the case of any other letter doesn't affect the results.
>
> I've worked around this issue by adding the mapping "Y => ij", i.e.:
>
> "char_filter": {
>   "char_mapper": {
>     "type": "mapping",
>     "mappings": [
>       "y => ij",
>       "Y => ij"
>     ]
>   }
> }
>
> This solves the problem, but I'd rather see that the lowercase filter is
> applied before the mapping, or, that I can make the order explicit. Is
> there any stance on this issue? Or is this intended behaviour?
>
> Regards,
> Matthias Hogerheijde
>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAzTpAxXiZtkpXh3JLga%3DmvX3MThcsFV-2YPOXDBWSphg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using a char_filter in combination with a lowercase filter

Reply via email to