Re: Language detection library

karl wettin Thu, 03 May 2007 22:56:08 -0700


4 maj 2007 kl. 02.20 skrev Chris Lu:

I suppose if a document is indexed as English or French,
when users searching the document,
we need to parse the query as English or French also?


If you do some language specific token analysis such as stemming, yes.

Detecting the language on such small texts is sort of tricky though.You might want to introduce more dimensions in the classifier: userlocation, user locale, et c. Perhaps you want to store stemmed datain language specific fields. It might also be a good idea to place aninitial query and re-classifiy to one of the top n scoring languageand then replace the query.

The easiest way out is to simply ask the user what language they wantto search in. And that seems to be the most common.


--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:

http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes



On 5/3/07, karl wettin <[EMAIL PROTECTED]> wrote:


3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK):

> Anyone knows of a good language detection library that candetect what

> language a document (text) is ?

I posted this some time back:

https://issues.apache.org/jira/browse/LUCENE-826

A bit of proof-of-concept:ish, but it does the job well if you ask

me. Uses Weka (GPL) and requires at least 150 characters to betrusted.



--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Language detection library

Reply via email to