html

Joseph Vychtrle (Commented) (JIRA) Sat, 05 Nov 2011 15:17:15 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144851#comment-13144851
 ]


Joseph Vychtrle commented on TIKA-772:
--------------------------------------

Weird,
{noformat}
java -jar tika-app-0.10.jar --detect < /tmp/docProv/html/it.html 
text/html
java -jar tika-app-0.10.jar --detect /tmp/docProv/html/it.html 
text/html
{noformat}

You can reproduce it like this :
{code}
@Test
public void test2Tika() throws Exception {
        File file = new File("/tmp/docProv/html/it.html");
        Tika tika = new Tika();
        String type = tika.detect(TikaInputStream.get(file));
        System.out.println(type);
}
{code}

Output :
{noformat}text/plain{noformat}
                
> media type detection fails for html documents, results in text/plain instead 
> of text/html
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-772
>                 URL: https://issues.apache.org/jira/browse/TIKA-772
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Assignee: Jukka Zitting
>              Labels: detection, media-type
>         Attachments: html.zip, tika.png
>
>
> Hey, I was testing media type detection on most of the major document types, 
> but when testing html documents of cca 5000 words that starts with :
> <?xml version="1.0" encoding="UTF-8"?>
> composed of root "html" element and "p" elements only, it always results in 
> text/plain instead of text/html ...
> {code:title=Bar.java|borderStyle=solid}
> @Test
> public void testMediaType() throws Exception {
>         List<Document> allDocs = DocumentProvider.docsAsList();
>       Map<Document, String> failed = new HashMap<Document, String>();
>       for (Document doc : allDocs) {
>               Tika tika = new Tika();
>               String type = tika.detect(TikaInputStream.get(doc.getFile()));
>               if(!doc.getMediaType().toString().equals(type))
>                               failed.put(doc, type);  
>       }
>       
>       for (Document doc : failed.keySet()) {
>               log.error("expected: " + doc.getMediaTypeString() + "; actual: 
> " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
>       }
>       assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : 
> " + failed.size() + " documents;");
> }
> {code}
> Am I doing anything wrong ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

Reply via email to