[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925288#action_12925288 ]
Reinhard Schwab commented on TIKA-539: -------------------------------------- hi ken, in other words: it trusts the server iff(if and only if) the data provided by the server is not contradictional. if metadata and meta tags are not contradictional. the rule logic is: no encoding in metadata && encoding in meta tags --> return encoding in meta tags no encoding in meta tags && encoding in metadata --> return encoding in metadata encoding in meta tags && encoding in metadata && encodings are equal --> return this encoding return encoding from charset detection the rule logic now is: encoding in meta tags --> return this encoding encoding in metadata --> guide charset detector return encoding from charset detection there is no comparison/no matchmaking of meta tags and metadata. > Encoding detection is too biased by encoding in meta tag > -------------------------------------------------------- > > Key: TIKA-539 > URL: https://issues.apache.org/jira/browse/TIKA-539 > Project: Tika > Issue Type: Bug > Affects Versions: 0.8 > Reporter: Reinhard Schwab > Assignee: Ken Krugler > Fix For: 0.8 > > Attachments: TIKA-539.patch, TIKA-539_2.patch > > > if the encoding in the meta tag is wrong, this encoding is detected, > even if there is the right encoding set in metadata before(which can be from > http response header). > test code to reproduce: > static String content = "<html><head>\n" > + "<meta http-equiv=\"content-type\" > content=\"application/xhtml+xml; charset=iso-8859-1\" />" > + "</head><body>Über den Wolken\n</body></html>"; > /** > * @param args > * @throws IOException > * @throws TikaException > * @throws SAXException > */ > public static void main(String[] args) throws IOException, SAXException, > TikaException { > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/html"); > metadata.set(Metadata.CONTENT_ENCODING, "UTF-8"); > System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); > InputStream in = new > ByteArrayInputStream(content.getBytes("UTF-8")); > AutoDetectParser parser = new AutoDetectParser(); > BodyContentHandler h = new BodyContentHandler(10000); > parser.parse(in, h, metadata, new ParseContext()); > System.out.print(h.toString()); > System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); > } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.