html

Jukka Zitting (Commented) (JIRA) Sat, 05 Nov 2011 15:01:13 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144849#comment-13144849
 ]


Jukka Zitting commented on TIKA-772:
------------------------------------

The latter method makes also the .html suffix available to the detector, which 
helps Tika guess the type of the document. Anyway, Tika should be able to 
detect the correct type also with the former version.

Can you check what output you get from the following two commands:

{code}
$ java -jar tika-app-0.10.jar --detect < it.html
$ java -jar tika-app-0.10.jar --detect it.html
{code}

These calls are roughly equivalent to the two method calls you mentioned. On my 
computer both return text/html.
                
> media type detection fails for html documents, results in text/plain instead 
> of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, 
> but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in 
> text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
>       Map<Document, String> failed = new HashMap<Document, String>();
>       for (Document doc : allDocs) {
>               Tika tika = new Tika();
>               String type = tika.detect(TikaInputStream.get(doc.getFile()));
>               if(!doc.getMediaType().toString().equals(type))
>                               failed.put(doc, type);  
>       }
>       
>       for (Document doc : failed.keySet()) {
>               log.error("expected: " + doc.getMediaTypeString() + "; actual: 
> " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
>       }
>       assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : 
> " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Reply via email to