But how about one document contains more than two different languages ??
Eric On Apr 12, 2005 12:13 AM, Andy Roberts <[EMAIL PROTECTED]> wrote: > On Monday 11 Apr 2005 14:55, Mike Baranczak wrote: > > Your example with Arabic wouldn't work reliably either - there are > > several other languages that use the Arabic script (Persian for > > example). > > Good point. Although you could try a simple approach to test for the > additional characters that exist in Persian but not in Arabic. Although, this > again is not fool-proof. A letter-model approach would be better but is > rather time consuming. > > > > > This is the sort of problem that the end user can solve much better > > than the software can. > > > > I completely agree, which is why I originally suggested prompting the user for > this info. It may be the case that for the majority of queries, English is > the usual language. And it is probably more feasible to do a test to > determine whether the query English or not (still very tricky, mind). If not, > then prompt the user to specify their input language because otherwise, > results will be poor. > > Andy Roberts > > > -MB > > > > On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote: > > > Can you not provide the user with a option list to specify their input > > > language? > > > > > > Language identification can be a pretty tricky field. There are some > > > tricks > > > you can do with unicode to identify language, e.g., \u0600 - \u06FF > > > contains > > > the Arabic characters, so if you're input contains lots of chars > > > within this > > > range, you can guess that the input is Arabic, for example. > > > > > > The problem comes with differentiating between the languages that use > > > a Latin > > > alphabet. Again, there are multiple approaches, although the only one > > > I know > > > of that worked pretty well for identifying European languages was to > > > build a > > > model based on character bigrams (that is, sequences of two letters) > > > [1] > > > > > > At the end of the day, Lucene cannot help you in choosing the correct > > > language > > > as it doesn't know, and so it'll be up to you to add the necessary > > > logic to > > > tell Lucene which Analyzers to utilise. :( > > > > > > Andy > > > > > > [1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C. > > > Bigram and > > > trigram models for language identification and classification in: > > > Evett, L & > > > Rose,T (editors) Computational Linguistics for Speech and Handwriting > > > Recognition AISB'94 Workshop University of Leeds/AISB. 1994. > > > > > > On Monday 11 Apr 2005 01:21, Eric Chow wrote: > > >> Hello, > > >> > > >> If I don't know the language of the input terms, how can I use > > >> different analyzer to search it ? > > >> > > >> For example, the input box accepts UTF-8 search text, they can be > > >> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How > > >> can search any of them or all of them with Lucene? > > >> > > >> Any example, please? > > >> > > >> > > >> Best Regards, > > >> Eric > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > > >> For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]