[ https://issues.apache.org/jira/browse/NUTCH-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gerard Bouchar updated NUTCH-2599: ---------------------------------- Description: Here is an example page that is displayed correctly in web browsers, but is decoded with the wrong charset in nutch : [https://gerardbouchar.github.io/html-encoding-example/index.html] This page's contents are encoded in UTF-8, it is served with HTTP headers indicating that it is in UTF-8, but it contains a bogus HTML meta tag indicating that is encoded in ISO-8859-1. This is a tricky case, but there is a [W3C specification about how to handle it|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding]. It clearly states that the HTTP header (transport layer information) should have precedence over the HTML meta tag (obtained in [byte stream prescanning|https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding]). Browsers do respect the spec, but the tika parser doesn't. Looking at the source code, it looks like the charset information is not even extracted from the HTTP headers. {code:java} HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 <!doctype html> <html> <head> <meta charset="iso-8859-1"> </head> <body> <a href="/">français</a> </body> </html> {code} was: Here is an example page that is displayed correctly in web browsers, but is decoded with the wrong charset in nutch : [https://gerardbouchar.github.io/html-encoding-example/index.html] This page's contents are encoded in UTF-8, it is served with HTTP headers indicating that it is in UTF-8, but it contains a bogus HTML meta tag indicating that is encoded in ISO-8859-1. This is a tricky case, but there is a [W3C specification about how to handle it|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding]. It clearly states that the HTTP header (transport layer information) should have precedence over the HTML meta tag (obtained in [byte stream prescanning|https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding]). Browsers do respect the spec, but the tika parser doesn't. {code:java} HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 <!doctype html> <html> <head> <meta charset="iso-8859-1"> </head> <body> <a href="/">français</a> </body> </html> {code} > charset detection issue with parse-tika > --------------------------------------- > > Key: NUTCH-2599 > URL: https://issues.apache.org/jira/browse/NUTCH-2599 > Project: Nutch > Issue Type: Bug > Components: parser > Environment: {code:java} > plugin.includes: protocol-http|parse-tika{code} > Reporter: Gerard Bouchar > Priority: Major > > Here is an example page that is displayed correctly in web browsers, but is > decoded with the wrong charset in nutch : > [https://gerardbouchar.github.io/html-encoding-example/index.html] > > This page's contents are encoded in UTF-8, it is served with HTTP headers > indicating that it is in UTF-8, but it contains a bogus HTML meta tag > indicating that is encoded in ISO-8859-1. > > This is a tricky case, but there is a [W3C specification about how to handle > it|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding]. > It clearly states that the HTTP header (transport layer information) should > have precedence over the HTML meta tag (obtained in [byte stream > prescanning|https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding]). > Browsers do respect the spec, but the tika parser doesn't. > > Looking at the source code, it looks like the charset information is not even > extracted from the HTTP headers. > > {code:java} > HTTP/1.1 200 OK > Content-Type: text/html; charset=utf-8 > <!doctype html> > <html> > <head> > <meta charset="iso-8859-1"> > </head> > <body> > <a href="/">français</a> > </body> > </html> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)