media type detection fails for html documents, results in text/plain instead of
text/html
-----------------------------------------------------------------------------------------
Key: TIKA-772
URL: https://issues.apache.org/jira/browse/TIKA-772
Project: Tika
Issue Type: Bug
Components: mime
Affects Versions: 0.10
Reporter: Joseph Vychtrle
Hey, I was testing media type detection on most of the major document types,
but when testing html documents of cca 5000 words that starts with :
<?xml version="1.0" encoding="UTF-8"?>
composed of root "html" element and "p" elements only, it always results in
text/plain instead of text/html ...
{code:title=Bar.java|borderStyle=solid}
@Test
public void testMediaType() throws Exception {
List<Document> allDocs = DocumentProvider.docsAsList();
Map<Document, String> failed = new HashMap<Document, String>();
for (Document doc : allDocs) {
Tika tika = new Tika();
String type = tika.detect(TikaInputStream.get(doc.getFile()));
if(!doc.getMediaType().toString().equals(type))
failed.put(doc, type);
}
for (Document doc : failed.keySet()) {
log.error("expected: " + doc.getMediaTypeString() + "; actual:
" + failed.get(doc) + "; path to file: " + doc.getFile().getAbsolutePath());
}
assertTrue(failed.isEmpty(), "mime type was incorrectly detected for :
" + failed.size() + " documents;");
}
{code}
Am I doing anything wrong ?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira