[ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928832#action_12928832
 ] 

Ken Krugler edited comment on TIKA-539 at 11/5/10 5:39 PM:
-----------------------------------------------------------

I've spent a bit more time on this, exploring why the wrong charset was being 
returned for an HTML page I've got.

Given what I know about the HTML5 spec.and Reinhard's proposal above, I think 
the appropriate process is:

1. Extract encoding from content-type response header and meta tags. Each is 
optional.

2. Normalize (resolve aliases, etc) each charset name that was found.

3. If all normalized charset names are in agreement, or are subsets (e.g. one 
is us-ascii, one is ISO-8859-1) then use the "highest" (most inclusive) 
encoding.

4. If no normalized charset names are found, or there is disagreement, use the 
statistical charset detection code.

A similar approach (excluding meta tags) should be used by the TXTParser.



      was (Author: kkrugler):
    I've spent a bit more time on this, exploring why the wrong charset was 
being returned for an HTML page I've got.

Given what I know about the HTML5 spec.and Reinhard's proposal above, I think 
the appropriate process is:

1. Extract encoding from response headers (content-type, content-encoding) and 
meta tags. Each is optional.

2. Normalize (resolve aliases, etc) each charset name that was found.

3. If all normalized charset names are in agreement, or are subsets (e.g. one 
is us-ascii, one is ISO-8859-1) then use the "highest" (most inclusive) 
encoding.

4. If no normalized charset names are found, or there is disagreement, use the 
statistical charset detection code.

A similar approach (excluding meta tags) should be used by the TXTParser.


  
> Encoding detection is too biased by encoding in meta tag
> --------------------------------------------------------
>
>                 Key: TIKA-539
>                 URL: https://issues.apache.org/jira/browse/TIKA-539
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Reinhard Schwab
>            Assignee: Ken Krugler
>             Fix For: 0.9
>
>         Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "<html><head>\n"
>                       + "<meta http-equiv=\"content-type\" 
> content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>                       + "</head><body>Über den Wolken\n</body></html>";
>       /**
>        * @param args
>        * @throws IOException
>        * @throws TikaException
>        * @throws SAXException
>        */
>       public static void main(String[] args) throws IOException, SAXException,
>                       TikaException {
>               Metadata metadata = new Metadata();
>               metadata.set(Metadata.CONTENT_TYPE, "text/html");
>               metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>               InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>               AutoDetectParser parser = new AutoDetectParser();
>               BodyContentHandler h = new BodyContentHandler(10000);
>               parser.parse(in, h, metadata, new ParseContext());
>               System.out.print(h.toString());
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>       }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to