[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-539: ----------------------------------- - push to 1.3 > Encoding detection is too biased by encoding in meta tag > -------------------------------------------------------- > > Key: TIKA-539 > URL: https://issues.apache.org/jira/browse/TIKA-539 > Project: Tika > Issue Type: Bug > Components: metadata, parser > Affects Versions: 0.8, 0.9, 0.10 > Reporter: Reinhard Schwab > Assignee: Ken Krugler > Fix For: 1.3 > > Attachments: TIKA-539.patch, TIKA-539_2.patch > > > if the encoding in the meta tag is wrong, this encoding is detected, > even if there is the right encoding set in metadata before(which can be from > http response header). > test code to reproduce: > static String content = "<html><head>\n" > + "<meta http-equiv=\"content-type\" > content=\"application/xhtml+xml; charset=iso-8859-1\" />" > + "</head><body>Über den Wolken\n</body></html>"; > /** > * @param args > * @throws IOException > * @throws TikaException > * @throws SAXException > */ > public static void main(String[] args) throws IOException, SAXException, > TikaException { > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/html"); > metadata.set(Metadata.CONTENT_ENCODING, "UTF-8"); > System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); > InputStream in = new > ByteArrayInputStream(content.getBytes("UTF-8")); > AutoDetectParser parser = new AutoDetectParser(); > BodyContentHandler h = new BodyContentHandler(10000); > parser.parse(in, h, metadata, new ParseContext()); > System.out.print(h.toString()); > System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira