[ 
https://issues.apache.org/jira/browse/TIKA-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018947#comment-17018947
 ] 

Nick Burch commented on TIKA-2294:
----------------------------------

For fully accurate OOXML (and other zip-subtype) detection, you need to have 
the Tika Parsers jar on your classpath, along with the dependencies. That's 
because Tika needs to look inside the zip and potentially check some files in 
there to be sure of the type

If you want best-guess detection, which probably would be fine for this case, 
the mime-magic in Tika Core + filename hint should do you. IIRC calling detect 
with a File object will do that for you, if detecting on a stream you will need 
to set the filename as a hint on the metadata object passed to detection

> Tika inconsistently detects ooxml files as zip file sometimes
> -------------------------------------------------------------
>
>                 Key: TIKA-2294
>                 URL: https://issues.apache.org/jira/browse/TIKA-2294
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.11
>         Environment: linux
>            Reporter: chanchal
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: google_doc.docx
>
>
> Tika sometimes incorrectly detects  ooxml file as zip and sometimes correctly 
> detects as docx/pptx/xlsx.
> Is there a possibility of it happening and how?
> I cannot share the file as it has sensitive content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to