: Here's the unicode letter after "th":
: https://www.fileformat.info/info/unicode/char/0435/index.htm
: To my surprise, I couldn't find it in the ascii folding filter:
: Anybody remembers whether the omission of Cyrillic characters was
: intentional (there is quite a few of them that are nearly identical in
: appearance to Latin letters).

>From the javadocs, i'm going to guess it's because the the filter focuses 
on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE" 
isn't described as being a "(adjective) LATIN noun (WITH noun)" like all 
of the other characters that are considered to have a direct mapping to 
the "ASCII" / latin characters.

If you look back at when it was added...


...the original focus was on deprecating "ISOLatin1AccentFilter" and 
replacing it with "a more comprehensive version of this code that included 
not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin 
Extended A unicode blocks."  (The originally proposed name was 
'ISOLatinAccentFilter') ... subsequent discussion focused on adding more 
Latin blocks.

There was a related issue at the time which initially aimed to add a 
more general "UnicodeNormalizationFilter" that ultimated resulted in 
adding the "ICU" analysis classes...


..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't 
tested that)


To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to