[
https://issues.apache.org/jira/browse/TIKA-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986248#comment-17986248
]
Hudson commented on TIKA-4441:
------------------------------
SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk17 #787 (See
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/787/])
TIKA-4441 -- revert markLimit and add unit tests (#2261) (tallison:
[https://github.com/apache/tika/commit/fbac9f217088a06360fedc2017fe656d2c7a4c49])
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/POIFSContainerDetector.java
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/configs/tika-4441-neg1.xml
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/configs/tika-4441-12000000.xml
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/configs/tika-4441-120.xml
* (edit) CHANGES.txt
> InputStream is consumed by Tika.detect for certain files
> --------------------------------------------------------
>
> Key: TIKA-4441
> URL: https://issues.apache.org/jira/browse/TIKA-4441
> Project: Tika
> Issue Type: Bug
> Affects Versions: 3.2.0, 3.2.1
> Reporter: Alvaro
> Priority: Major
> Fix For: 3.2.1
>
> Attachments: Test.doc, Test.ppt, Test.xls
>
>
> Hello,
> We've been using Tika version 3.1.0 to successfully detect MimeTypes from
> files before uploading them to our S3.
> However, after v3.2.0 upgrade, we've noticed that the original inputStream is
> being consumed entirely for certain file extensions.
> The affected extensions seem to be all for Microsoft files, pointing us to
> the POIFSContainerDetector, which was actually changed for this release.
> This is the list of extensions we've tested with errors: doc, docx, odt, ppt,
> pptx, xls, xlsx
> And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, svg,
> txt
>
> Here's some code to reproduce the issue:
> {code:java}
> class TikaBugReport {
> // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx
> public static void main(String[] args) throws IOException {
> String fileName = "Test.docx";
> InputStream inputStream = new
> ClassPathResource(fileName).getInputStream();
> checkFileMime(inputStream, fileName);
> }
> public static void checkFileMime(InputStream inputStream, String
> fileName) {
> try {
> Tika tika = new Tika();
> System.out.println("InputStream available bytes before
> processing: " + inputStream.available());
> System.out.println("InputStream supports mark: " +
> inputStream.markSupported());
> Metadata metadata = new Metadata();
> TikaInputStream tikaInputStream =
> TikaInputStream.get(inputStream);
> System.out.println("Original InputStream available bytes after
> TikaInputStream.get(): " + inputStream.available());
> String mimeType = tika.detect(tikaInputStream, metadata);
> // Debug: Check state after detection
> System.out.println("Original InputStream available bytes after
> tika.detect(): " + inputStream.available());
> System.out.println("TikaInputStream available bytes after
> tika.detect(): " + tikaInputStream.available());
> if (inputStream.available() == 0) {
> throw new IllegalStateException("InputStream is empty after
> TikaInputStream creation");
> }
> } catch (Exception e) {
> System.out.printf("Mime check exception for file '%s': [%s]%n",
> fileName, e.getMessage());
> }
> }
> }{code}
> After testing version 3.2.1, the issue is fixed for most file extensions, but
> .doc, .ppt and .xls extensions are still failing. Find sample files attached
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)