[ 
https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107273#comment-13107273
 ] 

Ken Krugler commented on TIKA-431:
----------------------------------

Re "if there is any ambiguity, then its clearly wrong already". If there's 
ambiguity between the response header and the meta tag, then it's clear that 
one is wrong, but in my experience meta tags are a lot more accurate than the 
server response headers.

> Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
> use the charset part of the Content-Type header properly.
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-431
>                 URL: https://issues.apache.org/jira/browse/TIKA-431
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>            Reporter: Erik Hetzner
>            Assignee: Ken Krugler
>         Attachments: TIKA-431.patch
>
>
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
> use the charset part of the Content-Type header properly.
> Content-Encoding is not for the charset. It is for values like gzip, deflate, 
> compress, or identity.
> Charset is passed in with the Content-Type. For instance: text/html; 
> charset=iso-8859-1
> Tika should, in my opinion, do the following:
> 1. Stop using Content-Encoding, unless it wants me to be able to pass in 
> gzipped content in an input stream.
> 2. Parse and understand charset=... declarations if passed in the Metadata 
> object
> 3. Return charset=... declarations in the Metadata object if a charset is 
> detected.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to