[ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reinhard Schwab updated TIKA-539:
---------------------------------

    Attachment: TIKA-539_2.patch

ignore my first version of the patch.
the encoding detection in the patch works now like:

if there is an encoding detected in meta tags, it is returned
if there is no encoding defined in metadata
or if there is an encoding defined in metadata and they are equal

if there is no encoding detected in meta tags,
then the encoding defined in metadata is returned.

if no encoding is detected in meta tags,
and no encoding is provided by metadata,
the charset dedector is used.

> Encoding detection is too biased by encoding in meta tag
> --------------------------------------------------------
>
>                 Key: TIKA-539
>                 URL: https://issues.apache.org/jira/browse/TIKA-539
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Reinhard Schwab
>            Assignee: Ken Krugler
>             Fix For: 0.8
>
>         Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "<html><head>\n"
>                       + "<meta http-equiv=\"content-type\" 
> content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>                       + "</head><body>Über den Wolken\n</body></html>";
>       /**
>        * @param args
>        * @throws IOException
>        * @throws TikaException
>        * @throws SAXException
>        */
>       public static void main(String[] args) throws IOException, SAXException,
>                       TikaException {
>               Metadata metadata = new Metadata();
>               metadata.set(Metadata.CONTENT_TYPE, "text/html");
>               metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>               InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>               AutoDetectParser parser = new AutoDetectParser();
>               BodyContentHandler h = new BodyContentHandler(10000);
>               parser.parse(in, h, metadata, new ParseContext());
>               System.out.print(h.toString());
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>       }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to