Re: Multy Language documents indexing

Ivan Vasilev Fri, 23 Feb 2007 11:00:40 -0800

Thanks Erik,

Here I describe about my research on this problem. It might be helpfulfor someone :)

I will divide the problem with multiple language docs in some subproblems:
*1. Determining the language in the text documents.

1.1. Determining the language in document when the whole text is in oneand the same language.*In the Lucene forum I found the following links to sites which providetools for fulfilling this task.

http://odur.let.rug.nl/~vannoord/TextCat/
*_http://frank.spieleck.de/ngram/_*

I have made some tests with the first one and the results for Englishand German are 100% guess but for Russian 0% guess (I used the theencoding windows 1251 which is claimed to be supported for the Russiantext recognition).Link to this demo<http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/>http://odur.let.rug.nl/~vannoord/TextCat/Demo/<http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/><http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/>Other similar sites at:<http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/>_http://odur.let.rug.nl/~vannoord/TextCat/competitors.html_<http://odur.let.rug.nl/%7Evannoord/TextCat/competitors.html>Note that it is important that when indexing the proper Analyzer will bechosen when creating Indexer because when searching with a searcher thatuses not proper analyzer then the results bight be not correct. Example:If we index German document using Lucene GermanAnalyzer and then wesearch some German word in the doc by using StandardAnalyzer in ispossible the word is not found.

*1.2. Determining the language in document when the some part of thetext in a document is in one language other in a different language.*

I did not found tools for this neither free not commercial.

*2. How to keep the terms for documents when each document is indifferent language.*There was a discussion about this in this forum and the approach that Ibest like is the one suggested in the mailhttp://mail-archives.apache.org/mod_mbox/lucene-java-dev/200211.mbox/[EMAIL PROTECTED]<http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200211.mbox/[EMAIL PROTECTED]>It suggests to index all the docs in one index no matter which analyzerwe use.I have done some tests with indexing in this way (see the attachedsource files and sample text files – names of the sample text files areimportant – they are hard coded in the sources to avoid using languagerecognizer – I did not have enough time to use it.)


*3. The encoding of the documents.*

There is another thing that is important – encoding for plain textdocuments written in different languages. It is important when indexingto know the encoding of the plain text documents otherwise the resultsare incorrect. For example when some document is encoded in “ISO-8859-1“ and when creating index we wrongly decide it is in “UTF-16 “ then theresults of searching are wrong.So may be we have to write also some class(es) that will determine theright encoding of the document based on its BOM or if missing on thetext contained in it.Just some fun: MS Notepad has a bug in this sense - when the filecreated by Notepad contains exactly one of this texts: “this app canbreak” OR “tuka ima golem bug” (without line separator) then the sameNotepad can not read it (unlike Wordpad or other programs) :). Thesecond in Bulgarian means “here is a big bug”.


Best Regards,
Ivan Vasilev


Erick Erickson wrote:

I know this has been discussed several times, but sure don't remember the
answers. Search the mail archive for "multiple languages" and you'll find
some good suggestions. But as I remember, it's not a trivial issue.

But I don't see why the "three different documents" approach wouldn'twork.

You could also index the same text in three different fields in a single
document, using different language analyzers for each (See
PerFieldAnalyzerWrapper).....

Erick

On 2/22/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:


Hi All,

Our application that uses Lucene for indexing will be used to index
documents that each of which contains parts written in different
languages. For example some document could contain English, Chinese and
Brazilian text. So how to index such document? Is there some best
practice to do this?

What comes in my mind is to index 3 different Lucene Documents for the
real document and keep in a database the meta info that these 3
Documents are related to our real doc. For example for the myDoc.doc we
will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when
making search when the searched word is found in myDocCn.doc we will
visualize to user myDoc.doc. Disadvantage here is that in this case the
occurrences of the searched item will have to be recalculated. It is
important for queries like "Red NEAR/10 fox". So if someone knows better
practice than this, please let me help.

Tanks in advance,
Ivan


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multy Language documents indexing

Reply via email to