Robert - can you elaborate on what you mean by "just treat it at the script level"?
On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir <rcm...@gmail.com> wrote: > Bradford, there is an arabic analyzer in trunk. for farsi there is > currently a patch available: > http://issues.apache.org/jira/browse/LUCENE-1628 > > one option is not to detect languages at all. > it could be hard for short queries due to the languages you mentioned > borrowing from each other. > but you do not want to apply things like stemming to the wrong language. > > instead, you could use ArabicTokenizer + ArabicNormalizationFilter + > PersianNormalizationFilter and just treat it at the script level. > > On Thu, Aug 6, 2009 at 3:46 PM, Bradford > Stephens<bradfordsteph...@gmail.com> wrote: > > Hey there, > > > > We're trying to add foreign language support into our new search > > engine -- languages like Arabic, Farsi, and Urdu (that don't work with > > standard analyzers). But our data source doesn't tell us which > > languages we're actually collecting -- we just get blocks of text. Has > > anyone here worked on language detection so we can figure out what > > analyzers to use? Are there commercial solutions? > > > > Much appreciated! > > > > -- > > http://www.roadtofailure.com -- The Fringes of Scalability, Social > > Media, and Computer Science > > > > > > -- > Robert Muir > rcm...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >