Alvaro created TIKA-4441:
----------------------------
Summary: InputStream is consumed by Tika.detect for certain files
Key: TIKA-4441
URL: https://issues.apache.org/jira/browse/TIKA-4441
Project: Tika
Issue Type: Bug
Affects Versions: 3.2.0, 3.2.1
Reporter: Alvaro
Attachments: Test.doc, Test.ppt, Test.xls
Hello,
We've been using Tika version 3.1.0 to successfully detect MimeTypes from files
before uploading them to our S3.
However, after v3.2.0 upgrade, we've noticed that the original inputStream is
being consumed entirely for certain file extensions.
The affected extensions seem to be all for Microsoft files, pointing us to the
POIFSContainerDetector, which was actually changed for this release.
This is the list of extensions we've tested with errors: doc, docx, odt, ppt,
pptx, xls, xlsx
And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, svg, txt
Here's some code to reproduce the issue:
{code:java}
class TikaBugReport {
// affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx
public static void main(String[] args) throws IOException {
String fileName = "Test.docx";
InputStream inputStream = new
ClassPathResource(fileName).getInputStream();
checkFileMime(inputStream, fileName);
}
public static void checkFileMime(InputStream inputStream, String fileName) {
try {
Tika tika = new Tika();
System.out.println("InputStream available bytes before processing:
" + inputStream.available());
System.out.println("InputStream supports mark: " +
inputStream.markSupported());
Metadata metadata = new Metadata();
TikaInputStream tikaInputStream = TikaInputStream.get(inputStream);
System.out.println("Original InputStream available bytes after
TikaInputStream.get(): " + inputStream.available());
String mimeType = tika.detect(tikaInputStream, metadata);
// Debug: Check state after detection
System.out.println("Original InputStream available bytes after
tika.detect(): " + inputStream.available());
System.out.println("TikaInputStream available bytes after
tika.detect(): " + tikaInputStream.available());
if (inputStream.available() == 0) {
throw new IllegalStateException("InputStream is empty after
TikaInputStream creation");
}
} catch (Exception e) {
System.out.printf("Mime check exception for file '%s': [%s]%n",
fileName, e.getMessage());
}
}
}{code}
After testing version 3.2.1, the issue is fixed for most file extensions, but
.doc, .ppt and .xls extensions are still failing. Find sample files attached
--
This message was sent by Atlassian Jira
(v8.20.10#820010)