Thanks Erik,
Here I describe about my research on this problem. It might be helpful for someone :)
I will divide the problem with multiple language docs in some subproblems:
*1. Determining the language in the text documents.
1.1. Determining the language in document when the whole text is in one and the same language.* In the Lucene forum I found the following links to sites which provide tools for fulfilling this task.
http://odur.let.rug.nl/~vannoord/TextCat/
*_http://frank.spieleck.de/ngram/_*
I have made some tests with the first one and the results for English and German are 100% guess but for Russian 0% guess (I used the the encoding windows 1251 which is claimed to be supported for the Russian text recognition). Link to this demo <http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/>http://odur.let.rug.nl/~vannoord/TextCat/Demo/ <http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/> <http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/> Other similar sites at: <http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/>_http://odur.let.rug.nl/~vannoord/TextCat/competitors.html_ <http://odur.let.rug.nl/%7Evannoord/TextCat/competitors.html> Note that it is important that when indexing the proper Analyzer will be chosen when creating Indexer because when searching with a searcher that uses not proper analyzer then the results bight be not correct. Example: If we index German document using Lucene GermanAnalyzer and then we search some German word in the doc by using StandardAnalyzer in is possible the word is not found.

*1.2. Determining the language in document when the some part of the text in a document is in one language other in a different language.*
I did not found tools for this neither free not commercial.

*2. How to keep the terms for documents when each document is in different language.* There was a discussion about this in this forum and the approach that I best like is the one suggested in the mail http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200211.mbox/[EMAIL PROTECTED] <http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200211.mbox/[EMAIL PROTECTED]> It suggests to index all the docs in one index no matter which analyzer we use. I have done some tests with indexing in this way (see the attached source files and sample text files – names of the sample text files are important – they are hard coded in the sources to avoid using language recognizer – I did not have enough time to use it.)

*3. The encoding of the documents.*
There is another thing that is important – encoding for plain text documents written in different languages. It is important when indexing to know the encoding of the plain text documents otherwise the results are incorrect. For example when some document is encoded in “ISO-8859-1 “ and when creating index we wrongly decide it is in “UTF-16 “ then the results of searching are wrong. So may be we have to write also some class(es) that will determine the right encoding of the document based on its BOM or if missing on the text contained in it. Just some fun: MS Notepad has a bug in this sense - when the file created by Notepad contains exactly one of this texts: “this app can break” OR “tuka ima golem bug” (without line separator) then the same Notepad can not read it (unlike Wordpad or other programs) :). The second in Bulgarian means “here is a big bug”.

Best Regards,
Ivan Vasilev


Erick Erickson wrote:
I know this has been discussed several times, but sure don't remember the
answers. Search the mail archive for "multiple languages" and you'll find
some good suggestions. But as I remember, it's not a trivial issue.

But I don't see why the "three different documents" approach wouldn't work.
You could also index the same text in three different fields in a single
document, using different language analyzers for each (See
PerFieldAnalyzerWrapper).....

Erick

On 2/22/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:

Hi All,

Our application that uses Lucene for indexing will be used to index
documents that each of which contains parts written in different
languages. For example some document could contain English, Chinese and
Brazilian text. So how to index such document? Is there some best
practice to do this?

What comes in my mind is to index 3 different Lucene Documents for the
real document and keep in a database the meta info that these 3
Documents are related to our real doc. For example for the myDoc.doc we
will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when
making search when the searched word is found in myDocCn.doc we will
visualize to user myDoc.doc. Disadvantage here is that in this case the
occurrences of the searched item will have to be recalculated. It is
important for queries like "Red NEAR/10 fox". So if someone knows better
practice than this, please let me help.

Tanks in advance,
Ivan


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to