Re: Multi-lingual Search & Accent Marks

Walter Underwood Sat, 31 Aug 2019 12:47:50 -0700

> On Aug 31, 2019, at 12:00 PM, Toke Eskildsen <t...@kb.dk> wrote:
> 
> Whenever we do this normalisation, we index two versions in our index: A very 
> lightly normalised (lowercased) field and a heavily normalised field: If a 
> record has a title "Köket" (kitchen in Swedish), we store title_orig:köket 
> and title_norm:køket. […] Going with what we do, my answer would be: Yes, do 
> preserve and also remove :-)



Right after I posted, I realized that I wanted to say “include all” as an 
option. They can even be in the same field with synonyms at the same token 
position.

Also, don’t worry too much about creating junk terms in the index with nonsense 
transliterations. Terms are cheap in search indexes (up to a point). So it 
really is OK to have all of these indexed at the same position, even if the 
last one is garbage. This still has the schön/schon problem, but at least there 
is a match.

coöperation
cooperation
cooepoeration (typewriter umlaut version)

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Multi-lingual Search & Accent Marks

Reply via email to