Is that 'blocks of text' is a (unicode) Java string? I don't think
this is the case, but then, use Character.UnicodeBlock to identify the
language of the text.

And, is that just text files with unknown character encoding? Then ICU
has a 'charset detector' that you can use. This feature 'suggests' a
charset (with some probability values) from a byte stream. I don't
know about it's performance on accuracy and speed. Go to the website
http://userguide.icu-project.org/conversion/detection.

Hope it helps.

- Cheolgoo Kang



On Fri, Aug 7, 2009 at 4:46 AM, Bradford
Stephens<bradfordsteph...@gmail.com> wrote:
> Hey there,
>
> We're trying to add foreign language support into our new search
> engine -- languages like Arabic, Farsi, and Urdu (that don't work with
> standard analyzers). But our data source doesn't tell us which
> languages we're actually collecting -- we just get blocks of text. Has
> anyone here worked on language detection so we can figure out what
> analyzers to use? Are there commercial solutions?
>
> Much appreciated!
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>

Reply via email to