[ 
https://issues.apache.org/jira/browse/TIKA-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018613#comment-17018613
 ] 

Andrey Nizienko commented on TIKA-2294:
---------------------------------------

[~tallison] thanks for the hint, do you mean this kind of approach with 
MediaType :

{code:java}
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.mime.MediaType;
import org.apache.tika.metadata.Metadata;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;


public class TikaFileCheckMediaType  {

        public static void main(String[] args)  {
                try {
                    File initialFile = new File("D:/google_doc.docx");
                    InputStream targetStream = new FileInputStream(initialFile);
                    TikaConfig config = TikaConfig.getDefaultConfig();
                    Detector detector = config.getDetector();

                    TikaInputStream stream = TikaInputStream.get(targetStream);

                    Metadata metadata = new Metadata();
                    metadata.add(Metadata.RESOURCE_NAME_KEY, "google_doc.docx");
                    MediaType mediaType = detector.detect(stream, metadata);

                    System.out.println(mediaType.getSubtype());
                } catch ( IOException e) {
                    System.out.println(e);
                } 
        }
}
{code}

The output in this case is:
vnd.openxmlformats-officedocument.wordprocessingml.document

and looks like Metadata is playing the key role here, as if to use 
detector.detect(stream, new Metadata()); 
the result output is also zip

I think it's possible to add this additional check to the code but I wonder if 
this will be available OOTB?
It would be handy to use tika.detect(fileContent) only without additional 
Metadata check.

> Tika inconsistently detects ooxml files as zip file sometimes
> -------------------------------------------------------------
>
>                 Key: TIKA-2294
>                 URL: https://issues.apache.org/jira/browse/TIKA-2294
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.11
>         Environment: linux
>            Reporter: chanchal
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: google_doc.docx
>
>
> Tika sometimes incorrectly detects  ooxml file as zip and sometimes correctly 
> detects as docx/pptx/xlsx.
> Is there a possibility of it happening and how?
> I cannot share the file as it has sensitive content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to