[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144862#comment-13144862 ]
Jukka Zitting commented on TIKA-772: ------------------------------------ The metacharacters you mention do sound suspicious. Here's what the attached it.html looks inside; no weird metacharacters here: {noformat} $ od -c it.html | head 0000000 < ? x m l v e r s i o n = " 1 0000020 . 0 " e n c o d i n g = " U T 0000040 F - 8 " ? > \n < h t m l > < p > 0000060 P a r e r e d e l C o m i t 0000100 a t o e c o n o m i c o e 0000120 s o c i a l e e u r o p e o 0000140 s u l t e m a I l r u o l 0000160 o d e l l a s o c i e t 303 240 0000200 c i v i l e n e l l e r e 0000220 l a z i o n i U E - S e r b i {noformat} I still get "text/html" when running the test against this file. > media type detection fails for html documents, results in text/plain instead > of text/html > ----------------------------------------------------------------------------------------- > > Key: TIKA-772 > URL: https://issues.apache.org/jira/browse/TIKA-772 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 0.10 > Reporter: Joseph Vychtrle > Assignee: Jukka Zitting > Labels: detection, media-type > Attachments: html.zip, it.html, tika.png > > > Hey, I was testing media type detection on most of the major document types, > but when testing html documents of cca 5000 words that starts with : > <?xml version="1.0" encoding="UTF-8"?> > composed of root "html" element and "p" elements only, it always results in > text/plain instead of text/html ... > {code:title=Bar.java|borderStyle=solid} > @Test > public void testMediaType() throws Exception { > List<Document> allDocs = DocumentProvider.docsAsList(); > Map<Document, String> failed = new HashMap<Document, String>(); > for (Document doc : allDocs) { > Tika tika = new Tika(); > String type = tika.detect(TikaInputStream.get(doc.getFile())); > if(!doc.getMediaType().toString().equals(type)) > failed.put(doc, type); > } > > for (Document doc : failed.keySet()) { > log.error("expected: " + doc.getMediaTypeString() + "; actual: > " + failed.get(doc) + "; path to file: " + doc.getFile().getAbsolutePath()); > } > assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : > " + failed.size() + " documents;"); > } > {code} > Am I doing anything wrong ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira