Thanks! I've created https://issues.apache.org/jira/browse/TIKA-4441, let me know if you need anything else from my side or if I can help in any way
On Mon, Jun 23, 2025 at 2:27 PM Tilman Hausherr <[email protected]> wrote: > On 6/23/2025 2:21 PM, Alvaro Nogueira via user wrote: > > Hi Tilman! > Thanks for the quick reply. According to my tests, the issue is fixed for > docx, odt, pptx, and xlsx, but still happening for doc, ppt and xls > extensions > I will test it further and let you know if I find anything, but hopefully > that can point you in the right direction > > Hi, > > I've approved your JIRA request, please report it there and don't forget > to include a file. > > Tilman > > > > Regards, > Alvaro > > On Mon, Jun 23, 2025 at 11:11 AM Tilman Hausherr <[email protected]> > wrote: > >> Hi, >> >> Please test with the unreleased 3.2.1: >> >> https://dist.apache.org/repos/dist/dev/tika/3.2.1/ >> >> https://repository.apache.org/content/repositories/orgapachetika-1115/org/apache/tika >> >> Tilman >> >> >> >> On 6/23/2025 11:01 AM, Alvaro Nogueira via user wrote: >> >> >> >> ---------- Forwarded message --------- >> From: Alvaro Nogueira <[email protected]> >> Date: Mon, Jun 23, 2025 at 10:54 AM >> Subject: InputStream consumed by Tika.detect >> To: <[email protected]> >> >> >> Hello, >> We've been using Tika version 3.1.0 to successfully detect MimeTypes from >> files before uploading them to our S3. >> However, after v3.2.0 upgrade, we've noticed that the original >> inputStream is being consumed entirely for certain file extensions. >> The affected extensions seem to be all for Microsoft files, pointing us >> to the POIFSContainerDetector, which was actually changed for this release. >> This is the list of extensions we've tested with errors: doc, docx, odt, >> ppt, pptx, xls, xlsx >> And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, >> svg, txt >> >> Here's some code to reproduce the issue: >> >> class TikaBugReport { >> >> // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx public >> static void main(String[] args) throws IOException { >> String fileName = "Test.docx"; >> InputStream inputStream = new >> ClassPathResource(fileName).getInputStream(); >> checkFileMime(inputStream, fileName); >> } >> >> public static void checkFileMime(InputStream inputStream, String >> fileName) { >> try { >> Tika tika = new Tika(); >> System.out.println("InputStream available bytes before >> processing: " + inputStream.available()); >> System.out.println("InputStream supports mark: " + >> inputStream.markSupported()); >> >> Metadata metadata = new Metadata(); >> >> TikaInputStream tikaInputStream = >> TikaInputStream.get(inputStream); >> System.out.println("Original InputStream available bytes after >> TikaInputStream.get(): " + inputStream.available()); >> >> String mimeType = tika.detect(tikaInputStream, metadata); >> >> // Debug: Check state after detection >> System.out.println("Original InputStream available bytes after >> tika.detect(): " + inputStream.available()); >> System.out.println("TikaInputStream available bytes after >> tika.detect(): " + tikaInputStream.available()); >> if (inputStream.available() == 0) { >> throw new IllegalStateException("InputStream is empty after >> TikaInputStream creation"); >> } >> >> } catch (Exception e) { >> System.out.printf("Mime check exception for file '%s': [%s]%n", >> fileName, e.getMessage()); >> } >> } >> } >> >> >> -- >> Thank you and regards, >> >> Álvaro Nogueira >> Senior Software Engineer >> [image: Logo] <https://www.flywire.com/> [image: LinkedIn icon] >> <https://www.linkedin.com/company/flywire> [image: Twitter icon] >> <https://twitter.com/Flywire> [image: Facebook icon] >> <https://www.facebook.com/Flywire> [image: Instagram icon] >> <https://www.instagram.com/insideflywire/> >> >> Disclaimer for electronic communications >> <https://www.flywire.com/legal/disclaimer-for-electronic-communications> >> >> >> > Disclaimer for electronic communications > <https://www.flywire.com/legal/disclaimer-for-electronic-communications> > > > -- Disclaimer for electronic communications <https://www.flywire.com/legal/disclaimer-for-electronic-communications>
