[jira] Issue Comment Edited: (TIKA-539) Encoding detection is too biased by encoding in meta tag

Andrzej Bialecki (JIRA) Mon, 15 Nov 2010 08:11:43 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932103#action_12932103
 ]


Andrzej Bialecki  edited comment on TIKA-539 at 11/15/10 11:10 AM:
-------------------------------------------------------------------

I'm not sure what mechanism Tika uses. Nutch collects a list of clues, which 
come from protocol headers, html meta "sniffing" (there's a chicken and egg 
issue here, so it's at most a best effort attempt) and from ICU4j 
CharsetDetector, and then tries to pick the one with the greatest level of 
confidence. Clues from ICU are given a higher priority, so they win if they are 
present. After that the next candidate is the protocol header information, and 
finally the "sniffed" HTML meta encoding. The final encoding is also checked 
against Charset.isSupported and rejected if this returns false.

This algorithm is implemented in o.a.nutch.util.EncodingDetector, with the 
"sniffing" part implemented in parse-html/.../HtmlParser plugin.

      was (Author: ab):
    I'm not sure what mechanism Tika uses. Nutch collects a list of clues, 
which come from protocol headers, html meta "sniffing" (there's a chicken and 
egg issue here, so it's at most a best effort attempt) and from ICU4j 
CharsetDetector, and then tries to pick the one with the greatest level of 
confidence. Clues from ICU are given a higher priority, so they win if they are 
present. After that the next candidate is the protocol header information, and 
finally the "sniffed" HTML meta encoding. The final encoding is also checked 
against Charset.isSupported and rejected if this returns false.
  
> Encoding detection is too biased by encoding in meta tag
> --------------------------------------------------------
>
>                 Key: TIKA-539
>                 URL: https://issues.apache.org/jira/browse/TIKA-539
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 0.8
>            Reporter: Reinhard Schwab
>            Assignee: Ken Krugler
>             Fix For: 0.9
>
>         Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "<html><head>\n"
>                       + "<meta http-equiv=\"content-type\" 
> content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>                       + "</head><body>Über den Wolken\n</body></html>";
>       /**
>        * @param args
>        * @throws IOException
>        * @throws TikaException
>        * @throws SAXException
>        */
>       public static void main(String[] args) throws IOException, SAXException,
>                       TikaException {
>               Metadata metadata = new Metadata();
>               metadata.set(Metadata.CONTENT_TYPE, "text/html");
>               metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>               InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>               AutoDetectParser parser = new AutoDetectParser();
>               BodyContentHandler h = new BodyContentHandler(10000);
>               parser.parse(in, h, metadata, new ParseContext());
>               System.out.print(h.toString());
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>       }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-539) Encoding detection is too biased by encoding in meta tag

Reply via email to