Re: Text search in Arabic

Walter Underwood Thu, 20 May 2021 08:43:51 -0700

I recommend normalizing all characters with a compatibility transformation, 
whether they are Arabic or not.


We use this charFilter as the first step in every query and indexing analysis 
chain.

        <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>

You’ll also need to include the ICU library, which should be included by 
default. Actually, the compatbility normalization should be done by default, 
too. That transform was designed specifically for string matching and search.

We have this in every solrconfig.xml.

  <!-- extras for ICU-based Unicode normalization -->
  <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" 
regex=".*\.jar" />
  <lib 
dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
regex=".*\.jar" />

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)

> On May 20, 2021, at 9:38 AM, Mete Kural <[email protected]> wrote:
> 
> Hello Michael,
> 
> Thank you very much for this information.
> 
> I will try at  [email protected] 
> <mailto:[email protected]> also.
> 
> By the way, is the Arabic analyzer referenced here 
> (https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar)
>  just for the Arabic language or all languages written with the Arabic script?
> 
> Thank you,
> Mete
> 
> 
>> On May 20, 2021, at 4:35 PM, Michael Wechner <[email protected]> 
>> wrote:
>> 
>> Hi Mete
>> 
>> You might also want to try the [email protected] mailing list
>> 
>> https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>> 
>> Re languages other than english you might find more information at
>> 
>> https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>> 
>> whereas I just realize that the following link does not work anymore
>> 
>> https://lucene.apache.org/core/lucene-sandbox/
>> 
>> Are these analyzers now inside
>> 
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>> 
>> ?
>> 
>> Thanks
>> 
>> Michael
>> 
>> 
>> Am 20.05.21 um 14:48 schrieb Mete Kural:
>>> Hello Lucene Community,
>>> 
>>> I hope this finds you all well. I want to ask you if this would be the 
>>> right medium to discuss some matters surrounding text search in relation to 
>>> variant Unicode codings of words in Arabic and Arabic scripted languages. 
>>> This is not a great example but the said matters are similar to matters 
>>> around Latin scripted searches where the letter “İ” needs to be substituted 
>>> with “I” in searches and so forth. Would this mailing list be the best 
>>> medium to discuss such matters? If not, would you mind recommending me a 
>>> medium for discussion on this?
>>> 
>>> Kind regards,
>>> Mete Kural
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>

Re: Text search in Arabic

Reply via email to