[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

Ken Krugler (JIRA) Mon, 18 Jun 2018 21:28:27 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516644#comment-16516644
 ]


Ken Krugler commented on TIKA-2671:
-----------------------------------

Hi [~gbouchar] - I'm curious how much testing you did, and with what web 
browsers. Asking because I know (for sure) that browsers used to not trust the 
response headers due to commonly misconfigured web servers. But I just scanned 
the current Firefox source, and didn't find anything that looked like they were 
trying to be extra clever.

In any case, this isn't a discussion about trying to outsmart browsers, but 
rather what's the appropriate heuristic to use for detection. I'm fine with 
following what's recommended, but note that there are additional issues like 
conforming to 12.2.3, which says "A leading Byte Order Mark (BOM) causes the 
character encoding argument to be ignored and will itself be skipped.".

> HtmlEncodingDetector doesnt take provided metadata into account
> ---------------------------------------------------------------
>
>                 Key: TIKA-2671
>                 URL: https://issues.apache.org/jira/browse/TIKA-2671
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> org.apache.tika.parser.html.HtmlEncodingDetector ignores the document's 
> metadata. So when using it to detect the charset of an HTML document that 
> came with a conflicting charset specified at the transport layer level, the 
> encoding specified inside the file is used instead.
> This behavior does not conform to what is [specified by the W3C for 
> determining the character encoding of HTML 
> pages|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding].
>  This causes bugs similar to NUTCH-2599.
>  
> If HtmlEncodingDetector is not meant to take into account meta-information 
> about the document, then maybe another detector should be provided, that 
> would be a CompositeDetector including, in that order:
>  * a new, simple, MetadataEncodingDetector, that would simply return the 
> encoding
>  * the existing HtmlEncodingDetector
>  * a generic detector, like UniversalEncodingDetector



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

Reply via email to