[ 
https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107272#comment-13107272
 ] 

Ken Krugler commented on TIKA-431:
----------------------------------

For analysis, I used Tika charset detection and compared to <meta> http-equiv 
charset. See my email to the Tika list with subject "HUG talk on Public 
Terabyte Dataset project".

As I mentioned in that post, it's possible my analysis had errors, but the 
results weren't great for % of time that ICU4J matched the meta tag charset. 
When I looked at miss-matches manually, they mostly seemed to be issues with 
ICU, versus a bad meta tag charset.

>From the page at http://philip.html5.org/data/charsets.html#sniffing-bytes, I 
>don't see stats on comparing various declared encodings (e.g. what percentage 
>of the time did response header == meta == detected), which would be useful.

I've got some crawl data which, if I had time, I could run through a similar 
analysis but this time dump out all of the cases where ICU (with and without 
hints) differs from both.

> Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
> use the charset part of the Content-Type header properly.
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-431
>                 URL: https://issues.apache.org/jira/browse/TIKA-431
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>            Reporter: Erik Hetzner
>            Assignee: Ken Krugler
>         Attachments: TIKA-431.patch
>
>
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
> use the charset part of the Content-Type header properly.
> Content-Encoding is not for the charset. It is for values like gzip, deflate, 
> compress, or identity.
> Charset is passed in with the Content-Type. For instance: text/html; 
> charset=iso-8859-1
> Tika should, in my opinion, do the following:
> 1. Stop using Content-Encoding, unless it wants me to be able to pass in 
> gzipped content in an input stream.
> 2. Parse and understand charset=... declarations if passed in the Metadata 
> object
> 3. Return charset=... declarations in the Metadata object if a charset is 
> detected.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to