[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

Ken Krugler (JIRA) Fri, 15 Jun 2018 14:11:45 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514355#comment-16514355
 ]


Ken Krugler commented on TIKA-2671:
-----------------------------------

Unfortunately there's no great solution here. Ideally we'd have a content-based 
detector that would be properly influenced by the HTTP response header & HTML 
meta tags, but that's not the case. The HTML Standard approach fails when web 
servers lie (return incorrect HTTP response headers), which unfortunately is 
(or was) very common...maybe the situation has improved since I last had to 
deal with this extensively 5-7 years ago.

Though what I think [~gbouchar] is referring to here is step #6 in the HTML 
Standard, "Otherwise, if the user agent has information on the likely encoding 
for this page, ... then return that encoding, with the 
[confidence|https://html.spec.whatwg.org/multipage/parsing.html#concept-encoding-confidence]
 _tentative"._

So based on the HTML standard, the ordering for charset determination is...
 # HTTP response headers (step #3)
 # HTML meta element attributes (step #4)
 # Tika Metadata hint (step #6)
 # Analyze bytes (step #7)

I don't think we should do the extra work to support nested contexts (iframes), 
with is the HTML Standard's step #5.

> HtmlEncodingDetector doesnt take provided metadata into account
> ---------------------------------------------------------------
>
>                 Key: TIKA-2671
>                 URL: https://issues.apache.org/jira/browse/TIKA-2671
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> org.apache.tika.parser.html.HtmlEncodingDetector ignores the document's 
> metadata. So when using it to detect the charset of an HTML document that 
> came with a conflicting charset specified at the transport layer level, the 
> encoding specified inside the file is used instead.
> This behavior does not conform to what is [specified by the W3C for 
> determining the character encoding of HTML 
> pages|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding].
>  This causes bugs similar to NUTCH-2599.
>  
> If HtmlEncodingDetector is not meant to take into account meta-information 
> about the document, then maybe another detector should be provided, that 
> would be a CompositeDetector including, in that order:
>  * a new, simple, MetadataEncodingDetector, that would simply return the 
> encoding
>  * the existing HtmlEncodingDetector
>  * a generic detector, like UniversalEncodingDetector



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

Reply via email to