Re: Ascii folding

Uwe Schindler Sat, 11 Nov 2023 05:02:50 -0800

Hi Dawid,

the ASCII folding filter is meant to remove accents. You would like tohave searching for visually similar characters. These are 2 differentthings.

Actually Robert also has some config options, waht I generally use forwester european searches where some documents may contain names ofpeople (Author names, titles in cyrillic or other languages) it toconvert the tokens using ICU transliteration (use one of the ICU foldingfilters with the below config):

Transliterator.getInstance("Any-Latin; NFD; [:Nonspacing Mark:] Remove;NFKC; CaseFold", Transliterator.FORWARD);

This does convert everything to latin characters in a language-neutralway and then removes all accents by the trick "decompose, removenon-spacing mark, compose again and case-fold the result.


Uwe

Am 10.11.2023 um 19:03 schrieb Dawid Weiss:


Hi Steve, Chris,

Ok, makes sense. Thanks for the pointers. I agree the justificationfor the use of character-level normalization filters is highlycontext-dependent (for example, unsuitable when mixed languages arepresent on input).


Dawid

On Fri, Nov 10, 2023 at 6:58 PM Chris Hostetter<hossman_luc...@fucit.org> wrote:



    : Here's the unicode letter after "th":
    : https://www.fileformat.info/info/unicode/char/0435/index.htm
    :
    : To my surprise, I couldn't find it in the ascii folding filter:
    :
    :
    
https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
    :
    : Anybody remembers whether the omission of Cyrillic characters was
    : intentional (there is quite a few of them that are nearly
    identical in
    : appearance to Latin letters).

    From the javadocs, i'm going to guess it's because the the filter
    focuses
    on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL
    LETTER IE"
    isn't described as being a "(adjective) LATIN noun (WITH noun)"
    like all
    of the other characters that are considered to have a direct
    mapping to
    the "ASCII" / latin characters.

    If you look back at when it was added...

    https://issues.apache.org/jira/browse/LUCENE-1390

    ...the original focus was on deprecating "ISOLatin1AccentFilter" and
    replacing it with "a more comprehensive version of this code that
    included
    not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
    Extended A unicode blocks."  (The originally proposed name was
    'ISOLatinAccentFilter') ... subsequent discussion focused on
    adding more
    Latin blocks.

    There was a related issue at the time which initially aimed to add a
    more general "UnicodeNormalizationFilter" that ultimated resulted in
    adding the "ICU" analysis classes...

    https://issues.apache.org/jira/browse/LUCENE-1343

    ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i
    haven't
    tested that)



    -Hoss
    http://www.lucidworks.com/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    For additional commands, e-mail: dev-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de

Re: Ascii folding

Reply via email to