On Aug 29, 2012, at 9:24am, Jukka Zitting wrote: > Hi, > > On Wed, Aug 29, 2012 at 6:02 PM, chraj007 <[email protected]> wrote: >> http://lucene.472066.n3.nabble.com/file/n4004078/test.html test.html > > Looks like that file has an incorrect http-equiv declaration: > > <META http-equiv="Content-Type" content="text/html; charset=utf-16"> > > The encoding of the file is not UTF-16. > > Can you file a TIKA issue about this? Tika should be able to > automatically detect the correct encoding and use it if the declared > one is obviously incorrect.
See https://issues.apache.org/jira/browse/TIKA-539 for an existing issue that discusses the challenges of what information to trust with charset detection. At the time of that issue, i was in favor of a heuristic that used server response/meta tags as truth (if they agreed), otherwise fall back to statistical analysis. But maybe statistical analysis is now fast/accurate enough, and we should only use the meta tag as a hint for ICU. -- Ken -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
