[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938867#comment-13938867 ]
lufeng commented on NUTCH-1733: ------------------------------- +1 pass all tests > parse-html to support HTML5 charset definitions > ----------------------------------------------- > > Key: NUTCH-1733 > URL: https://issues.apache.org/jira/browse/NUTCH-1733 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.8, 2.2.1 > Reporter: Sebastian Nagel > Fix For: 2.3, 1.9 > > Attachments: NUTCH-1733-trunk.patch, charset_bom_html5.html, > charset_html5.html > > > HTML 5 allows to specify the character encoding of a page per > * {{<meta charset="...">}} > * Unicode Byte Order Mark (BOM) > These are allowed in addition to previous HTTP/http-equiv Content-Type, see > [[1|http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding]]. > Parse-html ignores both meta charset and BOM, falls back to the default > encoding (cp1252). Parse-tika sets the encoding appropriately. -- This message was sent by Atlassian JIRA (v6.2#6252)