[
https://issues.apache.org/jira/browse/TIKA-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018613#comment-17018613
]
Andrey Nizienko commented on TIKA-2294:
---------------------------------------
[~tallison] thanks for the hint, do you mean this kind of approach with
MediaType :
{code:java}
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.mime.MediaType;
import org.apache.tika.metadata.Metadata;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
public class TikaFileCheckMediaType {
public static void main(String[] args) {
try {
File initialFile = new File("D:/google_doc.docx");
InputStream targetStream = new FileInputStream(initialFile);
TikaConfig config = TikaConfig.getDefaultConfig();
Detector detector = config.getDetector();
TikaInputStream stream = TikaInputStream.get(targetStream);
Metadata metadata = new Metadata();
metadata.add(Metadata.RESOURCE_NAME_KEY, "google_doc.docx");
MediaType mediaType = detector.detect(stream, metadata);
System.out.println(mediaType.getSubtype());
} catch ( IOException e) {
System.out.println(e);
}
}
}
{code}
The output in this case is:
vnd.openxmlformats-officedocument.wordprocessingml.document
and looks like Metadata is playing the key role here, as if to use
detector.detect(stream, new Metadata());
the result output is also zip
I think it's possible to add this additional check to the code but I wonder if
this will be available OOTB?
It would be handy to use tika.detect(fileContent) only without additional
Metadata check.
> Tika inconsistently detects ooxml files as zip file sometimes
> -------------------------------------------------------------
>
> Key: TIKA-2294
> URL: https://issues.apache.org/jira/browse/TIKA-2294
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 1.11
> Environment: linux
> Reporter: chanchal
> Assignee: Tim Allison
> Priority: Major
> Attachments: google_doc.docx
>
>
> Tika sometimes incorrectly detects ooxml file as zip and sometimes correctly
> detects as docx/pptx/xlsx.
> Is there a possibility of it happening and how?
> I cannot share the file as it has sensitive content.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)