Encoding detection is too biased by encoding in meta tag
--------------------------------------------------------

                 Key: TIKA-539
                 URL: https://issues.apache.org/jira/browse/TIKA-539
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.8
            Reporter: Reinhard Schwab
             Fix For: 0.8


if the encoding in the meta tag is wrong, this encoding is detected,
even if there is the right encoding set in metadata before(which can be  from 
http response header).

test code to reproduce:

static String content = "<html><head>\n"
                        + "<meta http-equiv=\"content-type\" 
content=\"application/xhtml+xml; charset=iso-8859-1\" />"
                        + "</head><body>Über den Wolken\n</body></html>";

        /**
         * @param args
         * @throws IOException
         * @throws TikaException
         * @throws SAXException
         */
        public static void main(String[] args) throws IOException, SAXException,
                        TikaException {
                Metadata metadata = new Metadata();
                metadata.set(Metadata.CONTENT_TYPE, "text/html");
                metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
                System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
                InputStream in = new 
ByteArrayInputStream(content.getBytes("UTF-8"));
                AutoDetectParser parser = new AutoDetectParser();
                BodyContentHandler h = new BodyContentHandler(10000);
                parser.parse(in, h, metadata, new ParseContext());
                System.out.print(h.toString());
                System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
        }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to