Hi Steve, Chris, Ok, makes sense. Thanks for the pointers. I agree the justification for the use of character-level normalization filters is highly context-dependent (for example, unsuitable when mixed languages are present on input).
Dawid On Fri, Nov 10, 2023 at 6:58 PM Chris Hostetter <hossman_luc...@fucit.org> wrote: > > : Here's the unicode letter after "th": > : https://www.fileformat.info/info/unicode/char/0435/index.htm > : > : To my surprise, I couldn't find it in the ascii folding filter: > : > : > https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java > : > : Anybody remembers whether the omission of Cyrillic characters was > : intentional (there is quite a few of them that are nearly identical in > : appearance to Latin letters). > > From the javadocs, i'm going to guess it's because the the filter focuses > on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE" > isn't described as being a "(adjective) LATIN noun (WITH noun)" like all > of the other characters that are considered to have a direct mapping to > the "ASCII" / latin characters. > > If you look back at when it was added... > > https://issues.apache.org/jira/browse/LUCENE-1390 > > ...the original focus was on deprecating "ISOLatin1AccentFilter" and > replacing it with "a more comprehensive version of this code that included > not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin > Extended A unicode blocks." (The originally proposed name was > 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more > Latin blocks. > > There was a related issue at the time which initially aimed to add a > more general "UnicodeNormalizationFilter" that ultimated resulted in > adding the "ICU" analysis classes... > > https://issues.apache.org/jira/browse/LUCENE-1343 > > ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't > tested that) > > > > -Hoss > http://www.lucidworks.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >