Re: Multiple Languages with Lucene (Arabic & English)

Grant Ingersoll Tue, 24 Jul 2007 05:11:20 -0700


On Jul 24, 2007, at 3:21 AM, Elie Choueiri wrote:

Hi
I'm new to searching and am trying to use Lucene to search English& Arabic
documents.  I've got a bunch of questions (hopefully you'll find some
interesting!) and am hoping someone's gone through some of them andhas some
answers for me!
First, do I have to worry about the Arabic Analyzer overwriting theindex
files of the English analyzer? (Or vice versa?)

i.e. When I index documents a second time, will data be overwritten?

That depends whether you tell Lucene to create a new index or not.See the IndexWriter API for your options.

I could just store the index files for different languages in adifferentlocation, but it's good to know and I'd rather not if I don't haveto :)
Also, on the same note, if I'm indexing documents that contain bothArabic
and English, will the index files created by the English (or Arabic)
analyzer contain garbage or become corrupted because of the language
difference?

I don't know if it will be corrupted, but probably won't be all thatuseful, either. You may find the PerFieldAnalyzerWrapper to be helpful.

It is possible to index (using an English/Latin/Standard analyzer)a filethat contains both english and arabic words, and expect thesearches in
English using the same analyzer to be valid, right?

I should think so. I don't recall running across this case too much,but do remember the reverse, Arabic files w/ some English and theArabic analyzer usually just skipped over the English leaving itintact, thus searching those English terms in the Arabic index workedjust fine.

In an Arabic document with a single English word (the name of acorporation,for example) will the English word even be indexed and located by asearch?I could test something like this with a small subset of documents,but I
doubt the actual usefulness of a test with such a tiny (relatively
speaking!) amount of data.. I know we can tell Lucene to store thefull copy
of the document, but does that affect the index itself?
Finally, and here's the tricky one, are searches that contain bothEnglishand Arabic words valid? My limited understanding of the way searchengines
work tells me the search analyzes the context of words as well as
statistical data to decide the relevance of hits, is this stillvalid for
multi-lingual searches?

They are valid, just not sure how useful, but that is for your app todecide. I guess if your users know both Arabic and English, itprobably isn't a big deal. Lucene just tries to match up what is inthe query w/ what is in the index, so if you have validly analyzedtokens in both the query and the index then Lucene should find them.


HTH,
Grant

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiple Languages with Lucene (Arabic & English)

Reply via email to