Hi,

As answer to your question looking for character substitutions. There is the 
ICU library doing this with ICU Transformers. It may also change all Cyrillic 
text to latin during indexing and search. This greatly helps people to find 
stuff.

A great example of a transformer is here as part of elasticsearch's 
documentation. I regularly use it when language of text is unknown and can only 
be tokenized: 
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-transform.html

The example mentioned there replaces any text with a transformation to latin 
characters, then decomposes umlauts and accents, strips those accents after the 
decomposition, and composes the remaining chars again. After that you have 
tokens in mostly latin without any accents.

You can use this also in Solr or pure Lucene (ICUTransformTokenFilter).

Uwe

Am May 20, 2021 1:35:45 PM UTC schrieb Michael Wechner 
<michael.wech...@wyona.com>:
>Hi Mete
>
>You might also want to try the java-u...@lucene.apache.org mailing list
>
>https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>
>Re languages other than english you might find more information at
>
>https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>
>whereas I just realize that the following link does not work anymore
>
>https://lucene.apache.org/core/lucene-sandbox/
>
>Are these analyzers now inside
>
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>
>?
>
>Thanks
>
>Michael
>
>
>Am 20.05.21 um 14:48 schrieb Mete Kural:
>> Hello Lucene Community,
>>
>> I hope this finds you all well. I want to ask you if this would be
>the right medium to discuss some matters surrounding text search in
>relation to variant Unicode codings of words in Arabic and Arabic
>scripted languages. This is not a great example but the said matters
>are similar to matters around Latin scripted searches where the letter
>“İ” needs to be substituted with “I” in searches and so forth. Would
>this mailing list be the best medium to discuss such matters? If not,
>would you mind recommending me a medium for discussion on this?
>>
>> Kind regards,
>> Mete Kural
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>For additional commands, e-mail: dev-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Reply via email to