[ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925288#action_12925288
 ] 

Reinhard Schwab commented on TIKA-539:
--------------------------------------

hi ken,

in other words:
it trusts the server iff(if and only if) the data provided by the server is not 
contradictional.
if  metadata and meta tags are not contradictional.

the rule logic is:
 
no encoding in metadata && encoding in meta tags --> return encoding in meta 
tags

no encoding in meta tags && encoding in metadata --> return encoding in metadata

encoding in meta tags && encoding in metadata && encodings are equal --> return 
this encoding

return encoding from charset detection


the rule logic now is:

encoding in meta tags --> return this encoding

encoding in metadata --> guide charset detector

return encoding from charset detection

there is no comparison/no matchmaking of meta tags and metadata.

> Encoding detection is too biased by encoding in meta tag
> --------------------------------------------------------
>
>                 Key: TIKA-539
>                 URL: https://issues.apache.org/jira/browse/TIKA-539
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Reinhard Schwab
>            Assignee: Ken Krugler
>             Fix For: 0.8
>
>         Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "<html><head>\n"
>                       + "<meta http-equiv=\"content-type\" 
> content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>                       + "</head><body>Über den Wolken\n</body></html>";
>       /**
>        * @param args
>        * @throws IOException
>        * @throws TikaException
>        * @throws SAXException
>        */
>       public static void main(String[] args) throws IOException, SAXException,
>                       TikaException {
>               Metadata metadata = new Metadata();
>               metadata.set(Metadata.CONTENT_TYPE, "text/html");
>               metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>               InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>               AutoDetectParser parser = new AutoDetectParser();
>               BodyContentHandler h = new BodyContentHandler(10000);
>               parser.parse(in, h, metadata, new ParseContext());
>               System.out.print(h.toString());
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>       }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to