Thanks!
I've created https://issues.apache.org/jira/browse/TIKA-4441, let me know
if you need anything else from my side or if I can help in any way

On Mon, Jun 23, 2025 at 2:27 PM Tilman Hausherr <[email protected]>
wrote:

> On 6/23/2025 2:21 PM, Alvaro Nogueira via user wrote:
>
> Hi Tilman!
> Thanks for the quick reply. According to my tests, the issue is fixed for
> docx, odt, pptx, and xlsx, but still happening for doc, ppt and xls
> extensions
> I will test it further and let you know if I find anything, but hopefully
> that can point you in the right direction
>
> Hi,
>
> I've approved your JIRA request, please report it there and don't forget
> to include a file.
>
> Tilman
>
>
>
> Regards,
> Alvaro
>
> On Mon, Jun 23, 2025 at 11:11 AM Tilman Hausherr <[email protected]>
> wrote:
>
>> Hi,
>>
>> Please test with the unreleased 3.2.1:
>>
>> https://dist.apache.org/repos/dist/dev/tika/3.2.1/
>>
>> https://repository.apache.org/content/repositories/orgapachetika-1115/org/apache/tika
>>
>> Tilman
>>
>>
>>
>> On 6/23/2025 11:01 AM, Alvaro Nogueira via user wrote:
>>
>>
>>
>> ---------- Forwarded message ---------
>> From: Alvaro Nogueira <[email protected]>
>> Date: Mon, Jun 23, 2025 at 10:54 AM
>> Subject: InputStream consumed by Tika.detect
>> To: <[email protected]>
>>
>>
>> Hello,
>> We've been using Tika version 3.1.0 to successfully detect MimeTypes from
>> files before uploading them to our S3.
>> However, after v3.2.0 upgrade, we've noticed that the original
>> inputStream is being consumed entirely for certain file extensions.
>> The affected extensions seem to be all for Microsoft files, pointing us
>> to the POIFSContainerDetector, which was actually changed for this release.
>> This is the list of extensions we've tested with errors: doc, docx, odt,
>> ppt, pptx, xls, xlsx
>> And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf,
>> svg, txt
>>
>> Here's some code to reproduce the issue:
>>
>> class TikaBugReport {
>>
>>     // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx     public 
>> static void main(String[] args) throws IOException {
>>         String fileName = "Test.docx";
>>         InputStream inputStream = new 
>> ClassPathResource(fileName).getInputStream();
>>         checkFileMime(inputStream, fileName);
>>     }
>>
>>     public static void checkFileMime(InputStream inputStream, String 
>> fileName) {
>>         try {
>>             Tika tika = new Tika();
>>             System.out.println("InputStream available bytes before 
>> processing: " + inputStream.available());
>>             System.out.println("InputStream supports mark: " + 
>> inputStream.markSupported());
>>
>>             Metadata metadata = new Metadata();
>>
>>             TikaInputStream tikaInputStream = 
>> TikaInputStream.get(inputStream);
>>             System.out.println("Original InputStream available bytes after 
>> TikaInputStream.get(): " + inputStream.available());
>>
>>             String mimeType = tika.detect(tikaInputStream, metadata);
>>
>>             // Debug: Check state after detection            
>> System.out.println("Original InputStream available bytes after 
>> tika.detect(): " + inputStream.available());
>>             System.out.println("TikaInputStream available bytes after 
>> tika.detect(): " + tikaInputStream.available());
>>             if (inputStream.available() == 0) {
>>                 throw new IllegalStateException("InputStream is empty after 
>> TikaInputStream creation");
>>             }
>>
>>         } catch (Exception e) {
>>             System.out.printf("Mime check exception for file '%s': [%s]%n", 
>> fileName, e.getMessage());
>>         }
>>     }
>> }
>>
>>
>> --
>> Thank you and regards,
>>
>> Álvaro Nogueira
>> Senior Software Engineer
>> [image: Logo] <https://www.flywire.com/> [image: LinkedIn icon]
>> <https://www.linkedin.com/company/flywire> [image: Twitter icon]
>> <https://twitter.com/Flywire> [image: Facebook icon]
>> <https://www.facebook.com/Flywire> [image: Instagram icon]
>> <https://www.instagram.com/insideflywire/>
>>
>> Disclaimer for electronic communications
>> <https://www.flywire.com/legal/disclaimer-for-electronic-communications>
>>
>>
>>
> Disclaimer for electronic communications
> <https://www.flywire.com/legal/disclaimer-for-electronic-communications>
>
>
>

-- 
Disclaimer for electronic communications 
<https://www.flywire.com/legal/disclaimer-for-electronic-communications>



Reply via email to