I recommend normalizing all characters with a compatibility transformation, 
whether they are Arabic or not. 

We use this charFilter as the first step in every query and indexing analysis 
chain.

        <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>

You’ll also need to include the ICU library, which should be included by 
default. Actually, the compatbility normalization should be done by default, 
too. That transform was designed specifically for string matching and search.

We have this in every solrconfig.xml.

  <!-- extras for ICU-based Unicode normalization -->
  <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" 
regex=".*\.jar" />
  <lib 
dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
regex=".*\.jar" />

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 20, 2021, at 9:38 AM, Mete Kural <meteku...@icloud.com.INVALID> wrote:
> 
> Hello Michael,
> 
> Thank you very much for this information.
> 
> I will try at  java-u...@lucene.apache.org 
> <mailto:java-u...@lucene.apache.org> also.
> 
> By the way, is the Arabic analyzer referenced here 
> (https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar)
>  just for the Arabic language or all languages written with the Arabic script?
> 
> Thank you,
> Mete
> 
> 
>> On May 20, 2021, at 4:35 PM, Michael Wechner <michael.wech...@wyona.com> 
>> wrote:
>> 
>> Hi Mete
>> 
>> You might also want to try the java-u...@lucene.apache.org mailing list
>> 
>> https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>> 
>> Re languages other than english you might find more information at
>> 
>> https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>> 
>> whereas I just realize that the following link does not work anymore
>> 
>> https://lucene.apache.org/core/lucene-sandbox/
>> 
>> Are these analyzers now inside
>> 
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>> 
>> ?
>> 
>> Thanks
>> 
>> Michael
>> 
>> 
>> Am 20.05.21 um 14:48 schrieb Mete Kural:
>>> Hello Lucene Community,
>>> 
>>> I hope this finds you all well. I want to ask you if this would be the 
>>> right medium to discuss some matters surrounding text search in relation to 
>>> variant Unicode codings of words in Arabic and Arabic scripted languages. 
>>> This is not a great example but the said matters are similar to matters 
>>> around Latin scripted searches where the letter “İ” needs to be substituted 
>>> with “I” in searches and so forth. Would this mailing list be the best 
>>> medium to discuss such matters? If not, would you mind recommending me a 
>>> medium for discussion on this?
>>> 
>>> Kind regards,
>>> Mete Kural
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
> 

Reply via email to