media type detection fails for html documents, results in text/plain instead of text/html -----------------------------------------------------------------------------------------
Key: TIKA-772 URL: https://issues.apache.org/jira/browse/TIKA-772 Project: Tika Issue Type: Bug Components: mime Affects Versions: 0.10 Reporter: Joseph Vychtrle Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with : <?xml version="1.0" encoding="UTF-8"?> composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ... {code:title=Bar.java|borderStyle=solid} @Test public void testMediaType() throws Exception { List<Document> allDocs = DocumentProvider.docsAsList(); Map<Document, String> failed = new HashMap<Document, String>(); for (Document doc : allDocs) { Tika tika = new Tika(); String type = tika.detect(TikaInputStream.get(doc.getFile())); if(!doc.getMediaType().toString().equals(type)) failed.put(doc, type); } for (Document doc : failed.keySet()) { log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + "; path to file: " + doc.getFile().getAbsolutePath()); } assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;"); } {code} Am I doing anything wrong ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira