Hi Steve, Chris,

Ok, makes sense. Thanks for the pointers. I agree the justification for the
use of character-level normalization filters is highly context-dependent
(for example, unsuitable when mixed languages are present on input).

Dawid

On Fri, Nov 10, 2023 at 6:58 PM Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> : Here's the unicode letter after "th":
> : https://www.fileformat.info/info/unicode/char/0435/index.htm
> :
> : To my surprise, I couldn't find it in the ascii folding filter:
> :
> :
> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
> :
> : Anybody remembers whether the omission of Cyrillic characters was
> : intentional (there is quite a few of them that are nearly identical in
> : appearance to Latin letters).
>
> From the javadocs, i'm going to guess it's because the the filter focuses
> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
> of the other characters that are considered to have a direct mapping to
> the "ASCII" / latin characters.
>
> If you look back at when it was added...
>
> https://issues.apache.org/jira/browse/LUCENE-1390
>
> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
> replacing it with "a more comprehensive version of this code that included
> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
> Extended A unicode blocks."  (The originally proposed name was
> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
> Latin blocks.
>
> There was a related issue at the time which initially aimed to add a
> more general "UnicodeNormalizationFilter" that ultimated resulted in
> adding the "ICU" analysis classes...
>
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
> tested that)
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to