[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler reassigned TIKA-539: -------------------------------- Assignee: Ken Krugler > Encoding detection is too biased by encoding in meta tag > -------------------------------------------------------- > > Key: TIKA-539 > URL: https://issues.apache.org/jira/browse/TIKA-539 > Project: Tika > Issue Type: Bug > Affects Versions: 0.8 > Reporter: Reinhard Schwab > Assignee: Ken Krugler > Fix For: 0.8 > > Attachments: TIKA-539.patch > > > if the encoding in the meta tag is wrong, this encoding is detected, > even if there is the right encoding set in metadata before(which can be from > http response header). > test code to reproduce: > static String content = "<html><head>\n" > + "<meta http-equiv=\"content-type\" > content=\"application/xhtml+xml; charset=iso-8859-1\" />" > + "</head><body>Über den Wolken\n</body></html>"; > /** > * @param args > * @throws IOException > * @throws TikaException > * @throws SAXException > */ > public static void main(String[] args) throws IOException, SAXException, > TikaException { > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/html"); > metadata.set(Metadata.CONTENT_ENCODING, "UTF-8"); > System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); > InputStream in = new > ByteArrayInputStream(content.getBytes("UTF-8")); > AutoDetectParser parser = new AutoDetectParser(); > BodyContentHandler h = new BodyContentHandler(10000); > parser.parse(in, h, metadata, new ParseContext()); > System.out.print(h.toString()); > System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); > } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.