Hi Dawid,

When I contributed to this class, I thought it was about the “looks like” 
relation (between source and target chars), so it would make sense to me to add 
Cyrillic.[1]

However, if you look at the other comments in that issue[1], you can see that 
there are conflicting language-specific issues that can arise, mostly(?) about 
“sounds like” or existing-language-specific-ascii-substitution relations, 
rather than simply “looks like”.

So IIRC, I excluded language-specific code blocks to avoid controversy like ^ , 
only phonetic blocks and Latin-specific blocks were included[2].

Steve

[1] 
https://issues.apache.org/jira/browse/LUCENE-1390?focusedCommentId=12635607&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-12635607
[2] 
https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html

> On Nov 10, 2023, at 12:19 PM, Dawid Weiss <dawid.we...@gmail.com> wrote:
> 
> 
> I just stumbled upon this stop word appearing in one of our indexes:
> 
> thе
> 
> Look closely. Can you see it? I doubt - I couldn't either. This is the hex 
> dump of that:
> 
> 74 68 d0 b5
> 
> which means 
> 
> thе and the 
> 
> are two different things.
> 
> Here's the unicode letter after "th":
> https://www.fileformat.info/info/unicode/char/0435/index.htm
> 
> To my surprise, I couldn't find it in the ascii folding filter:
> 
> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
> 
> Anybody remembers whether the omission of Cyrillic characters was intentional 
> (there is quite a few of them that are nearly identical in appearance to 
> Latin letters).
> 
> Dawid

Reply via email to