On Aug 29, 2012, at 9:24am, Jukka Zitting wrote:

> Hi,
> 
> On Wed, Aug 29, 2012 at 6:02 PM, chraj007 <chraj.k...@gmail.com> wrote:
>> http://lucene.472066.n3.nabble.com/file/n4004078/test.html test.html
> 
> Looks like that file has an incorrect http-equiv declaration:
> 
>    <META http-equiv="Content-Type" content="text/html; charset=utf-16">
> 
> The encoding of the file is not UTF-16.
> 
> Can you file a TIKA issue about this? Tika should be able to
> automatically detect the correct encoding and use it if the declared
> one is obviously incorrect.

See https://issues.apache.org/jira/browse/TIKA-539 for an existing issue that 
discusses the challenges of what information to trust with charset detection.

At the time of that issue, i was in favor of a heuristic that used server 
response/meta tags as truth (if they agreed), otherwise fall back to 
statistical analysis.

But maybe statistical analysis is now fast/accurate enough, and we should only 
use the meta tag as a hint for ICU.

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to