Re: Fails to detect language for UTF-8 file, but it works for ISO-latin

Mattmann, Chris A (388J) Tue, 24 Aug 2010 08:03:50 -0700

+1...

On 8/24/10 8:00 AM, "Jukka Zitting" <[email protected]> wrote:

Hi,

On Sat, Aug 21, 2010 at 5:55 PM, Jan Høydahl / Cominvent
<[email protected]> wrote:
> Detected as english. The same is true for the other test language files.
> It does not detect language for UTF-8 encoded files.

The tika-app jar doesn't do language detection by default. The
language metadata you're seeing is a result of the encoding-based
language estimate that we get from the ICU4J code we're using.
Apparently that data set categorizes ISO-8859-1 as an English-specific
character encoding.

We already dropped encoding-based language estimates from the HTML
parser, and I think we should do the same also for plain text
documents.

BR,

Jukka Zitting

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Fails to detect language for UTF-8 file, but it works for ISO-latin

Reply via email to