[ https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107272#comment-13107272 ]
Ken Krugler commented on TIKA-431: ---------------------------------- For analysis, I used Tika charset detection and compared to <meta> http-equiv charset. See my email to the Tika list with subject "HUG talk on Public Terabyte Dataset project". As I mentioned in that post, it's possible my analysis had errors, but the results weren't great for % of time that ICU4J matched the meta tag charset. When I looked at miss-matches manually, they mostly seemed to be issues with ICU, versus a bad meta tag charset. >From the page at http://philip.html5.org/data/charsets.html#sniffing-bytes, I >don't see stats on comparing various declared encodings (e.g. what percentage >of the time did response header == meta == detected), which would be useful. I've got some crawl data which, if I had time, I could run through a similar analysis but this time dump out all of the cases where ICU (with and without hints) differs from both. > Tika currently misuses the HTTP Content-Encoding header, and does not seem to > use the charset part of the Content-Type header properly. > --------------------------------------------------------------------------------------------------------------------------------------- > > Key: TIKA-431 > URL: https://issues.apache.org/jira/browse/TIKA-431 > Project: Tika > Issue Type: Bug > Components: general > Reporter: Erik Hetzner > Assignee: Ken Krugler > Attachments: TIKA-431.patch > > > Tika currently misuses the HTTP Content-Encoding header, and does not seem to > use the charset part of the Content-Type header properly. > Content-Encoding is not for the charset. It is for values like gzip, deflate, > compress, or identity. > Charset is passed in with the Content-Type. For instance: text/html; > charset=iso-8859-1 > Tika should, in my opinion, do the following: > 1. Stop using Content-Encoding, unless it wants me to be able to pass in > gzipped content in an input stream. > 2. Parse and understand charset=... declarations if passed in the Metadata > object > 3. Return charset=... declarations in the Metadata object if a charset is > detected. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira