[
https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144862#comment-13144862
]
Jukka Zitting commented on TIKA-772:
------------------------------------
The metacharacters you mention do sound suspicious. Here's what the attached
it.html looks inside; no weird metacharacters here:
{noformat}
$ od -c it.html | head
0000000 < ? x m l v e r s i o n = " 1
0000020 . 0 " e n c o d i n g = " U T
0000040 F - 8 " ? > \n < h t m l > < p >
0000060 P a r e r e d e l C o m i t
0000100 a t o e c o n o m i c o e
0000120 s o c i a l e e u r o p e o
0000140 s u l t e m a I l r u o l
0000160 o d e l l a s o c i e t 303 240
0000200 c i v i l e n e l l e r e
0000220 l a z i o n i U E - S e r b i
{noformat}
I still get "text/html" when running the test against this file.
> media type detection fails for html documents, results in text/plain instead
> of text/html
> -----------------------------------------------------------------------------------------
>
> Key: TIKA-772
> URL: https://issues.apache.org/jira/browse/TIKA-772
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 0.10
> Reporter: Joseph Vychtrle
> Assignee: Jukka Zitting
> Labels: detection, media-type
> Attachments: html.zip, it.html, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types,
> but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in
> text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
> List<Document> allDocs = DocumentProvider.docsAsList();
> Map<Document, String> failed = new HashMap<Document, String>();
> for (Document doc : allDocs) {
> Tika tika = new Tika();
> String type = tika.detect(TikaInputStream.get(doc.getFile()));
> if(!doc.getMediaType().toString().equals(type))
> failed.put(doc, type);
> }
>
> for (Document doc : failed.keySet()) {
> log.error("expected: " + doc.getMediaTypeString() + "; actual:
> " + failed.get(doc) + "; path to file: " + doc.getFile().getAbsolutePath());
> }
> assertTrue(failed.isEmpty(), "mime type was incorrectly detected for :
> " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira