: Here's the unicode letter after "th":
: https://www.fileformat.info/info/unicode/char/0435/index.htm
: 
: To my surprise, I couldn't find it in the ascii folding filter:
: 
: 
https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
: 
: Anybody remembers whether the omission of Cyrillic characters was
: intentional (there is quite a few of them that are nearly identical in
: appearance to Latin letters).

>From the javadocs, i'm going to guess it's because the the filter focuses 
on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE" 
isn't described as being a "(adjective) LATIN noun (WITH noun)" like all 
of the other characters that are considered to have a direct mapping to 
the "ASCII" / latin characters.

If you look back at when it was added...

https://issues.apache.org/jira/browse/LUCENE-1390

...the original focus was on deprecating "ISOLatin1AccentFilter" and 
replacing it with "a more comprehensive version of this code that included 
not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin 
Extended A unicode blocks."  (The originally proposed name was 
'ISOLatinAccentFilter') ... subsequent discussion focused on adding more 
Latin blocks.

There was a related issue at the time which initially aimed to add a 
more general "UnicodeNormalizationFilter" that ultimated resulted in 
adding the "ICU" analysis classes...

https://issues.apache.org/jira/browse/LUCENE-1343

..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't 
tested that)



-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to