Hi Dawid, When I contributed to this class, I thought it was about the “looks like” relation (between source and target chars), so it would make sense to me to add Cyrillic.[1]
However, if you look at the other comments in that issue[1], you can see that there are conflicting language-specific issues that can arise, mostly(?) about “sounds like” or existing-language-specific-ascii-substitution relations, rather than simply “looks like”. So IIRC, I excluded language-specific code blocks to avoid controversy like ^ , only phonetic blocks and Latin-specific blocks were included[2]. Steve [1] https://issues.apache.org/jira/browse/LUCENE-1390?focusedCommentId=12635607&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-12635607 [2] https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html > On Nov 10, 2023, at 12:19 PM, Dawid Weiss <dawid.we...@gmail.com> wrote: > > > I just stumbled upon this stop word appearing in one of our indexes: > > thе > > Look closely. Can you see it? I doubt - I couldn't either. This is the hex > dump of that: > > 74 68 d0 b5 > > which means > > thе and the > > are two different things. > > Here's the unicode letter after "th": > https://www.fileformat.info/info/unicode/char/0435/index.htm > > To my surprise, I couldn't find it in the ascii folding filter: > > https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java > > Anybody remembers whether the omission of Cyrillic characters was intentional > (there is quite a few of them that are nearly identical in appearance to > Latin letters). > > Dawid