[
https://issues.apache.org/jira/browse/TIKA-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644466#comment-13644466
]
Nick Burch commented on TIKA-1116:
----------------------------------
Detecting office file formats with just mime magic isn't possible with 100%
accuracy. If you want that, you need to allow the use of
POIFSContainerDetector, which works out the type based on the actual contents
of the container.
> Wrong detection of XLS/Doc fil
> ------------------------------
>
> Key: TIKA-1116
> URL: https://issues.apache.org/jira/browse/TIKA-1116
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 1.3, 1.4
> Reporter: Petr Pytelka
> Labels: DOC,, XLS
>
> My issue:
> I have valid XLS file and this file is detected as DOC.
> Cause:
> tika-mimetypes.xml contain lines:
> <mime-type type="application/msword">
> ..
> <match value="\320\317\021\340\241\261\032\341" type="string"
> offset="0"/>
> ..
> </mime-type>
> According to MS documentation this prefix can be in any Compound Binary file
> (DOC, XLS, PPT and others).
> There is documentation:
> http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/WindowsCompoundBinaryFileFormatSpecification.pdf
> (look at 2.1 Header)
> My proposal is to remove line
> <match value="\320\317\021\340\241\261\032\341" type="string"
> offset="0"/>
> from tika-mimetypes.xml.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira