Re: Multi-analyzer ?

Eric Chow Mon, 11 Apr 2005 17:53:25 -0700

But how about one document contains more than two different languages ??


Eric

On Apr 12, 2005 12:13 AM, Andy Roberts <[EMAIL PROTECTED]> wrote:
> On Monday 11 Apr 2005 14:55, Mike Baranczak wrote:
> > Your example with Arabic wouldn't work reliably either - there are
> > several other languages that use the Arabic script (Persian for
> > example).
> 
> Good point. Although you could try a simple approach to test for the
> additional characters that exist in Persian but not in Arabic. Although, this
> again is not fool-proof. A letter-model approach would be better but is
> rather time consuming.
> 
> >
> > This is the sort of problem that the end user can solve much better
> > than the software can.
> >
> 
> I completely agree, which is why I originally suggested prompting the user for
> this info. It may be the case that for the majority of queries, English is
> the usual language. And it is probably more feasible to do a test to
> determine whether the query English or not (still very tricky, mind). If not,
> then prompt the user to specify their input language because otherwise,
> results will be poor.
> 
> Andy Roberts
> 
> > -MB
> >
> > On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:
> > > Can you not provide the user with a option list to specify their input
> > > language?
> > >
> > > Language identification can be a pretty tricky field. There are some
> > > tricks
> > > you can do with unicode to identify language, e.g., \u0600 - \u06FF
> > > contains
> > > the Arabic characters, so if you're input contains lots of chars
> > > within this
> > > range, you can guess that the input is Arabic, for example.
> > >
> > > The problem comes with differentiating between the languages that use
> > > a Latin
> > > alphabet. Again, there are multiple approaches, although the only one
> > > I know
> > > of that worked pretty well for identifying European languages was to
> > > build a
> > > model based on character bigrams (that is, sequences of two letters)
> > > [1]
> > >
> > > At the end of the day, Lucene cannot help you in choosing the correct
> > > language
> > > as it doesn't know, and so it'll be up to you to add the necessary
> > > logic to
> > > tell Lucene which Analyzers to utilise. :(
> > >
> > > Andy
> > >
> > > [1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C.
> > > Bigram and
> > > trigram models for language identification and classification in:
> > > Evett, L &
> > > Rose,T (editors) Computational Linguistics for Speech and Handwriting
> > > Recognition AISB'94 Workshop University of Leeds/AISB. 1994.
> > >
> > > On Monday 11 Apr 2005 01:21, Eric Chow wrote:
> > >> Hello,
> > >>
> > >> If I don't know the language of the input terms, how can I use
> > >> different analyzer to search it ?
> > >>
> > >> For example, the input box accepts UTF-8 search text, they can be
> > >> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
> > >> can search any of them or all of them with Lucene?
> > >>
> > >> Any example, please?
> > >>
> > >>
> > >> Best Regards,
> > >> Eric
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >> For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multi-analyzer ?

Reply via email to