+1...
On 8/24/10 8:00 AM, "Jukka Zitting" <[email protected]> wrote: Hi, On Sat, Aug 21, 2010 at 5:55 PM, Jan Høydahl / Cominvent <[email protected]> wrote: > Detected as english. The same is true for the other test language files. > It does not detect language for UTF-8 encoded files. The tika-app jar doesn't do language detection by default. The language metadata you're seeing is a result of the encoding-based language estimate that we get from the ICU4J code we're using. Apparently that data set categorizes ISO-8859-1 as an English-specific character encoding. We already dropped encoding-based language estimates from the HTML parser, and I think we should do the same also for plain text documents. BR, Jukka Zitting ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
