4 maj 2007 kl. 02.20 skrev Chris Lu:

I suppose if a document is indexed as English or French,
when users searching the document,
we need to parse the query as English or French also?

If you do some language specific token analysis such as stemming, yes.

Detecting the language on such small texts is sort of tricky though. You might want to introduce more dimensions in the classifier: user location, user locale, et c. Perhaps you want to store stemmed data in language specific fields. It might also be a good idea to place an initial query and re-classifiy to one of the top n scoring language and then replace the query.

The easiest way out is to simply ask the user what language they want to search in. And that seems to be the most common.



--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php? title=Create_Lucene_Database_Search_in_3_minutes


On 5/3/07, karl wettin <[EMAIL PROTECTED]> wrote:

3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK):

> Anyone knows of a good language detection library that can detect what
> language a document (text) is ?

I posted this some time back:

https://issues.apache.org/jira/browse/LUCENE-826

A bit of proof-of-concept:ish, but it does the job well if you ask
me. Uses Weka (GPL) and requires at least 150 characters to be trusted.


--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to