Using a char_filter in combination with a lowercase filter

Matthias Hogerheijde Mon, 18 Aug 2014 02:14:42 -0700

Hi,

We're using Elasticsearch with an Analyzer to map the `y` character to 
`ij`, (*char_fitler* named "char_mapper") since in Dutch these two are 
"somewhat" interchangeable. We're also using a *lowercase filter*.


This is the configuration:

{
  "analysis": {
    "analyzer": {
      "index": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "synonym_twoway",
          "standard",
          "asciifolding"
        ],
        "char_filter": [
          "char_mapper"
        ]
      },
      "index_prefix": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "synonym_twoway",
          "standard",
          "asciifolding",
          "prefixes"
        ],
        "char_filter": [
          "char_mapper"
        ]
      },
      "search": {
        "alias": [
          "default"
        ],
        "type": "custom",
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "synonym",
          "synonym_twoway",
          "standard",
          "asciifolding"
        ],
        "char_filter": [
          "char_mapper"
        ]
      },
      "postal_code": {
        "tokenizer": "keyword",
        "filter": [
          "lowercase"
        ]
      }
    },
    "tokenizer": {
      "standard": {
        "stopwords": [


        ]
      }
    },
    "filter": {
      "synonym": {
        "type": "synonym",
        "synonyms": [
          "st => sint",
          "jp => jan pieterszoon",
          "mh => maarten harpertszoon"
        ]
      },
      "synonym_twoway": {
        "type": "synonym",
        "synonyms": [
          "den haag, s gravenhage",
          "den bosch, s hertogenbosch"
        ]
      },
      "prefixes": {
        "type": "edgeNGram",
        "side": "front",
        "min_gram": 1,
        "max_gram": 30
      }
    },
    "char_filter": {
      "char_mapper": {
        "type": "mapping",
        "mappings": [
          "y => ij"
        ]
      }
    }
  }
}

When indexing cities, we're using this mapping:

{
  "properties": {
    "city": {
      "type": "multi_field",
      "fields": {
        "city": {
          "type": "string"
        },
        "prefix": {
          "type": "string",
          "boost": 0.5,
          "index_analyzer": "index_prefix"
        }
      }
    },
    "province_code": {
      "type": "string"
    },
    "unique_name": {
      "type": "boolean"
    },
    "point": {
      "type": "geo_point"
    },
    "search_terms": {
      "type": "multi_field",
      "fields": {
        "search_terms": {
          "type": "string"
        },
        "prefix": {
          "boost": 0.5,
          "index_analyzer": "index_prefix",
          "type": "string"
        }
      }
    }
  },
  "search_analyzer": "search",
  "index_analyzer": "index"
}

When we index all the (Dutch) cities from our data-source, there are cities 
starting with both `IJ` and `Y`. (for example, these citiy names exist: 
*IJssel*, *IJsselstein*, *Yerseke* and *Ysselsteyn.*) It seems that these 
characters are not lowercased before the char_mapping is applied. 

Querying the index, results in

/top/city/_search?q=ijsselstein -> works, returns the document for 
IJsselstein
/top/city/_search?q=Ijsselstein -> works, returns the document for 
IJsselstein
/top/city/_search?q=yerseke -> *doesn't *work, returns nothing
/top/city/_search?q=Yerseke -> *does *work, returns the document for Yerseke
/top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing
/top/city/_search?q=Ysselsteyn -> *does *work, returns the document for 
Ysselsteyn

Changing the case of any other letter doesn't affect the results.

I've worked around this issue by adding the mapping "Y => ij", i.e.:

"char_filter": {
  "char_mapper": {
    "type": "mapping",
    "mappings": [
      "y => ij",
      "Y => ij"
    ]
  }
}

This solves the problem, but I'd rather see that the lowercase filter is 
applied before the mapping, or, that I can make the order explicit. Is 
there any stance on this issue? Or is this intended behaviour?

Regards,
Matthias Hogerheijde



-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Using a char_filter in combination with a lowercase filter

Reply via email to