[jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.

Chris A. Mattmann (JIRA) Thu, 18 Oct 2007 06:43:42 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535917
 ]


Chris A. Mattmann commented on TIKA-79:
---------------------------------------

Guys:

Why don't we put a utility method in MimeUtils to handle this functionality. 
The purpose of the utility method is to try and sense a mime type using all 
available options (URL resolution, extension ID, mime magic, etc.)

There is currently code in Nutch at:

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup

See the private String getContentType(String typeName, String url, byte[] data) 
method at the bottom of the class to see how Nutch does this sort of failsafe 
mime resolution. Perhaps we should follow similar suit in Tika?

Cheers,
 Chris


> Mime type detection from file header appears to be failing.
> -----------------------------------------------------------
>
>                 Key: TIKA-79
>                 URL: https://issues.apache.org/jira/browse/TIKA-79
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: AutoDetectParser.patch
>
>
> Unit tests to test the behavior of AutoDetectParser fail when byte header 
> detection is needed.  When correct names of resources and MIME types are 
> passed into the Metadata object, the values below show what was found.  Note 
> that some of the document types have null for typeFromHeader:
> typeFromContentTypeHint = application/vnd.ms-excel
> typeFromResourceName = application/vnd.ms-excel
> typeFromHeader = null
> type = application/vnd.ms-excel
> typeFromContentTypeHint = text/html
> typeFromResourceName = text/html
> typeFromHeader = text/html
> type = text/html
> typeFromContentTypeHint = application/vnd.oasis.opendocument.text
> typeFromResourceName = application/vnd.oasis.opendocument.text
> typeFromHeader = application/vnd.oasis.opendocument.text
> type = application/vnd.oasis.opendocument.text
> typeFromContentTypeHint = application/pdf
> typeFromResourceName = application/pdf
> typeFromHeader = application/pdf
> type = application/pdf
> typeFromContentTypeHint = application/vnd.ms-powerpoint
> typeFromResourceName = application/vnd.ms-powerpoint
> typeFromHeader = null
> type = application/vnd.ms-powerpoint
> log4j:WARN No appenders could be found for logger (root).
> log4j:WARN Please initialize the log4j system properly.
> typeFromContentTypeHint = application/rtf
> typeFromResourceName = application/rtf
> typeFromHeader = null
> type = application/rtf
> typeFromContentTypeHint = text/plain
> typeFromResourceName = text/plain
> typeFromHeader = null
> type = text/plain
> typeFromContentTypeHint = application/msword
> typeFromResourceName = application/msword
> typeFromHeader = null
> type = application/msword
> typeFromContentTypeHint = application/xml
> typeFromResourceName = application/xml
> typeFromHeader = null
> type = application/xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.

Reply via email to