[
https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535917
]
Chris A. Mattmann commented on TIKA-79:
---------------------------------------
Guys:
Why don't we put a utility method in MimeUtils to handle this functionality.
The purpose of the utility method is to try and sense a mime type using all
available options (URL resolution, extension ID, mime magic, etc.)
There is currently code in Nutch at:
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup
See the private String getContentType(String typeName, String url, byte[] data)
method at the bottom of the class to see how Nutch does this sort of failsafe
mime resolution. Perhaps we should follow similar suit in Tika?
Cheers,
Chris
> Mime type detection from file header appears to be failing.
> -----------------------------------------------------------
>
> Key: TIKA-79
> URL: https://issues.apache.org/jira/browse/TIKA-79
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 0.1-incubator
> Reporter: Keith R. Bennett
> Fix For: 0.1-incubator
>
> Attachments: AutoDetectParser.patch
>
>
> Unit tests to test the behavior of AutoDetectParser fail when byte header
> detection is needed. When correct names of resources and MIME types are
> passed into the Metadata object, the values below show what was found. Note
> that some of the document types have null for typeFromHeader:
> typeFromContentTypeHint = application/vnd.ms-excel
> typeFromResourceName = application/vnd.ms-excel
> typeFromHeader = null
> type = application/vnd.ms-excel
> typeFromContentTypeHint = text/html
> typeFromResourceName = text/html
> typeFromHeader = text/html
> type = text/html
> typeFromContentTypeHint = application/vnd.oasis.opendocument.text
> typeFromResourceName = application/vnd.oasis.opendocument.text
> typeFromHeader = application/vnd.oasis.opendocument.text
> type = application/vnd.oasis.opendocument.text
> typeFromContentTypeHint = application/pdf
> typeFromResourceName = application/pdf
> typeFromHeader = application/pdf
> type = application/pdf
> typeFromContentTypeHint = application/vnd.ms-powerpoint
> typeFromResourceName = application/vnd.ms-powerpoint
> typeFromHeader = null
> type = application/vnd.ms-powerpoint
> log4j:WARN No appenders could be found for logger (root).
> log4j:WARN Please initialize the log4j system properly.
> typeFromContentTypeHint = application/rtf
> typeFromResourceName = application/rtf
> typeFromHeader = null
> type = application/rtf
> typeFromContentTypeHint = text/plain
> typeFromResourceName = text/plain
> typeFromHeader = null
> type = text/plain
> typeFromContentTypeHint = application/msword
> typeFromResourceName = application/msword
> typeFromHeader = null
> type = application/msword
> typeFromContentTypeHint = application/xml
> typeFromResourceName = application/xml
> typeFromHeader = null
> type = application/xml
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.