Is that 'blocks of text' is a (unicode) Java string? I don't think this is the case, but then, use Character.UnicodeBlock to identify the language of the text.
And, is that just text files with unknown character encoding? Then ICU has a 'charset detector' that you can use. This feature 'suggests' a charset (with some probability values) from a byte stream. I don't know about it's performance on accuracy and speed. Go to the website http://userguide.icu-project.org/conversion/detection. Hope it helps. - Cheolgoo Kang On Fri, Aug 7, 2009 at 4:46 AM, Bradford Stephens<bradfordsteph...@gmail.com> wrote: > Hey there, > > We're trying to add foreign language support into our new search > engine -- languages like Arabic, Farsi, and Urdu (that don't work with > standard analyzers). But our data source doesn't tell us which > languages we're actually collecting -- we just get blocks of text. Has > anyone here worked on language detection so we can figure out what > analyzers to use? Are there commercial solutions? > > Much appreciated! > > -- > http://www.roadtofailure.com -- The Fringes of Scalability, Social > Media, and Computer Science >