Re: Text search in Arabic

2021-05-20 Thread Walter Underwood
I recommend normalizing all characters with a compatibility transformation, whether they are Arabic or not. We use this charFilter as the first step in every query and indexing analysis chain. You’ll also need to include the ICU library, which should be included by default.

Re: Text search in Arabic

2021-05-20 Thread Uwe Schindler
This is only for Arabic language. If you don't know the language and just want to assist people searching with different scripts (search with latin letters for Arabic text), see my other answer. Uwe Am May 20, 2021 2:38:26 PM UTC schrieb Mete Kural : >Hello Michael, > >Thank you very much

Re: Text search in Arabic

2021-05-20 Thread Uwe Schindler
Hi, As answer to your question looking for character substitutions. There is the ICU library doing this with ICU Transformers. It may also change all Cyrillic text to latin during indexing and search. This greatly helps people to find stuff. A great example of a transformer is here as part of

Re: Text search in Arabic

2021-05-20 Thread Mete Kural
Hello Michael, Thank you very much for this information. I will try at java-u...@lucene.apache.org also. By the way, is the Arabic analyzer referenced here (https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar) just for the Arabic language

Re: Text search in Arabic

2021-05-20 Thread Michael Wechner
Hi Mete You might also want to try the java-u...@lucene.apache.org mailing list https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg Re languages other than english you might find more information at

Re: Lucene 9.0 snapshot names

2021-05-20 Thread Uwe Schindler
The default suffix in this system prop is "SNAPSHOT" and the timestamp comes then from Maven's internal Logic, this cannot be changed. By overriding the suffix explicit (as said before and find by Jenkins) you convert it to an official "release" in Maven's sense and it is no longer a snapshot.

Re: Lucene 9.0 snapshot names

2021-05-20 Thread Uwe Schindler
Jenkins does this already: https://ci-builds.apache.org/job/Lucene/job/Lucene-Artifacts-main/242/ It uses build number! The system property "version suffix" is responsible and is set by Jenkins. See in command line: [Lucene-Artifacts-main] $

Text search in Arabic

2021-05-20 Thread Mete Kural
Hello Lucene Community, I hope this finds you all well. I want to ask you if this would be the right medium to discuss some matters surrounding text search in relation to variant Unicode codings of words in Arabic and Arabic scripted languages. This is not a great example but the said matters

Re: Lucene 9.0 snapshot names

2021-05-20 Thread Michael Sokolov
In principal it makes sense, but is there any chance the build artifact could vary for the same SHA? We hope not, I think, but stranger things have happened. Probably an edge case not worth worrying about though, and relying on the build server's clock doesn't seem great, so +1 from me, although I

Lucene 9.0 snapshot names

2021-05-20 Thread Alan Woodward
Hi all, I’m preparing a local lucene 9.0 snapshot build and I notice that the jar files generated by `./gradlew mavenToLocalFolder` are called something like `lucene-suggest-9.0.0-20210520.111833-1-javadoc.jar` - in other words, they are including a timestamp. For my setup I’d like to replace