Re: Language Detection for Analysis?

Shai Erera Thu, 06 Aug 2009 13:06:25 -0700

Robert - can you elaborate on what you mean by "just treat it at the script
level"?


On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir <[email protected]> wrote:

> Bradford, there is an arabic analyzer in trunk. for farsi there is
> currently a patch available:
> http://issues.apache.org/jira/browse/LUCENE-1628
>
> one option is not to detect languages at all.
> it could be hard for short queries due to the languages you mentioned
> borrowing from each other.
> but you do not want to apply things like stemming to the wrong language.
>
> instead, you could use ArabicTokenizer + ArabicNormalizationFilter +
> PersianNormalizationFilter and just treat it at the script level.
>
> On Thu, Aug 6, 2009 at 3:46 PM, Bradford
> Stephens<[email protected]> wrote:
> > Hey there,
> >
> > We're trying to add foreign language support into our new search
> > engine -- languages like Arabic, Farsi, and Urdu (that don't work with
> > standard analyzers). But our data source doesn't tell us which
> > languages we're actually collecting -- we just get blocks of text. Has
> > anyone here worked on language detection so we can figure out what
> > analyzers to use? Are there commercial solutions?
> >
> > Much appreciated!
> >
> > --
> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
> > Media, and Computer Science
> >
>
>
>
> --
> Robert Muir
> [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Language Detection for Analysis?

Reply via email to