Char filters are applied before the text is tokenized, and therefore they are applied before the "normal" filters are used, which is why they are a separate class of filter. With Lucene, the order is:
char filters -> tokenizer -> filters Have you looked into the ICU analyzer? http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-icu-plugin.html I have no idea how well it works with Dutch. Cheers, Ivan On Mon, Aug 18, 2014 at 2:14 AM, Matthias Hogerheijde < matthias.hogerhei...@goabout.com> wrote: > Hi, > > We're using Elasticsearch with an Analyzer to map the `y` character to > `ij`, (*char_fitler* named "char_mapper") since in Dutch these two are > "somewhat" interchangeable. We're also using a *lowercase filter*. > > This is the configuration: > > { > "analysis": { > "analyzer": { > "index": { > "type": "custom", > "tokenizer": "standard", > "filter": [ > "lowercase", > "synonym_twoway", > "standard", > "asciifolding" > ], > "char_filter": [ > "char_mapper" > ] > }, > "index_prefix": { > "type": "custom", > "tokenizer": "standard", > "filter": [ > "lowercase", > "synonym_twoway", > "standard", > "asciifolding", > "prefixes" > ], > "char_filter": [ > "char_mapper" > ] > }, > "search": { > "alias": [ > "default" > ], > "type": "custom", > "tokenizer": "standard", > "filter": [ > "lowercase", > "synonym", > "synonym_twoway", > "standard", > "asciifolding" > ], > "char_filter": [ > "char_mapper" > ] > }, > "postal_code": { > "tokenizer": "keyword", > "filter": [ > "lowercase" > ] > } > }, > "tokenizer": { > "standard": { > "stopwords": [ > > > ] > } > }, > "filter": { > "synonym": { > "type": "synonym", > "synonyms": [ > "st => sint", > "jp => jan pieterszoon", > "mh => maarten harpertszoon" > ] > }, > "synonym_twoway": { > "type": "synonym", > "synonyms": [ > "den haag, s gravenhage", > "den bosch, s hertogenbosch" > ] > }, > "prefixes": { > "type": "edgeNGram", > "side": "front", > "min_gram": 1, > "max_gram": 30 > } > }, > "char_filter": { > "char_mapper": { > "type": "mapping", > "mappings": [ > "y => ij" > ] > } > } > } > } > > When indexing cities, we're using this mapping: > > { > "properties": { > "city": { > "type": "multi_field", > "fields": { > "city": { > "type": "string" > }, > "prefix": { > "type": "string", > "boost": 0.5, > "index_analyzer": "index_prefix" > } > } > }, > "province_code": { > "type": "string" > }, > "unique_name": { > "type": "boolean" > }, > "point": { > "type": "geo_point" > }, > "search_terms": { > "type": "multi_field", > "fields": { > "search_terms": { > "type": "string" > }, > "prefix": { > "boost": 0.5, > "index_analyzer": "index_prefix", > "type": "string" > } > } > } > }, > "search_analyzer": "search", > "index_analyzer": "index" > } > > When we index all the (Dutch) cities from our data-source, there are > cities starting with both `IJ` and `Y`. (for example, these citiy names > exist: *IJssel*, *IJsselstein*, *Yerseke* and *Ysselsteyn.*) It seems > that these characters are not lowercased before the char_mapping is > applied. > > Querying the index, results in > > /top/city/_search?q=ijsselstein -> works, returns the document for > IJsselstein > /top/city/_search?q=Ijsselstein -> works, returns the document for > IJsselstein > /top/city/_search?q=yerseke -> *doesn't *work, returns nothing > /top/city/_search?q=Yerseke -> *does *work, returns the document for > Yerseke > /top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing > /top/city/_search?q=Ysselsteyn -> *does *work, returns the document for > Ysselsteyn > > Changing the case of any other letter doesn't affect the results. > > I've worked around this issue by adding the mapping "Y => ij", i.e.: > > "char_filter": { > "char_mapper": { > "type": "mapping", > "mappings": [ > "y => ij", > "Y => ij" > ] > } > } > > This solves the problem, but I'd rather see that the lowercase filter is > applied before the mapping, or, that I can make the order explicit. Is > there any stance on this issue? Or is this intended behaviour? > > Regards, > Matthias Hogerheijde > > > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAzTpAxXiZtkpXh3JLga%3DmvX3MThcsFV-2YPOXDBWSphg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.