[
https://issues.apache.org/jira/browse/TIKA-4469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18015108#comment-18015108
]
Tim Allison edited comment on TIKA-4469 at 8/20/25 10:47 AM:
-------------------------------------------------------------
Y, that would be really bad. I think the problem is with dependencies and not
with 3.2.2.
How are you managing dependencies and is the most recent version of
commons-compress on your class path?
commons-compress' {{ArchiveException}} used to extend {{Exception}} (which we
do not catch in {{{}detectArchiveFormat{}}}()}}. As of compress 1.28.0,
{{ArchiveException}} extends {{{}IOException{}}}, which we do catch.
was (Author: [email protected]):
How are you managing dependencies and is the most recent version of
commons-compress on your class path?
ArchiveException used to extend Exception (which we no longer catch). As of
1.28.0, ArchiveException extends IOException, which we do catch.
> After upgrading to 3.2.2 most files are incorrectly treated as Archive's by
> AutoDetectParser
> --------------------------------------------------------------------------------------------
>
> Key: TIKA-4469
> URL: https://issues.apache.org/jira/browse/TIKA-4469
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 3.2.2
> Reporter: Rob Vesse
> Priority: Major
> Attachments: test.pdf
>
>
> We had an application that was working fine with 3.2.1, after Dependabot
> suggested an upgrade to 3.2.2 the builds for that PR were failing. On
> investigation it was found that with 3.2.2 Tika the {{AutoDetectParser}}
> seems to treat every file as potentially being an archive file and then fails
> because it actually isn't:
> {noformat}
> Caused by: org.apache.commons.compress.archivers.ArchiveException: No
> Archiver found for the stream signature
> at
> org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(ArchiveStreamFactory.java:295)
> at
> org.apache.tika.detect.zip.DefaultZipContainerDetector.detectArchiveFormat(DefaultZipContainerDetector.java:122)
> at
> org.apache.tika.detect.zip.DefaultZipContainerDetector.detect(DefaultZipContainerDetector.java:180)
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:179)
> {noformat}
> Code is pretty straightforward (simplified to take out some application
> implementation detail):
> {noformat}
> Metadata tikaMetadata = new Metadata();
> tikaMetadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.pdf");
> tikaMetadata.set("Content-Type", "application/pdf");
> // Use a BodyContentHandler as we just want the textual output
> BodyContentHandler handler = new BodyContentHandler(-1);
> // Prepare a Tika parse context
> ParseContext context = new ParseContext();
> // Actually parse the document and then produce the output event
> // NB - input here in real code is a ByteArrayInputStream as these documents
> are coming to our code via a Kafka topic
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(input, handler, tikaMetadata, context);
> {noformat}
> I have attached the example {{test.pdf}} to this ticket. Note that this bug
> happens with all files types, including things like plain text.
> The "fix" for using 3.2.2 seems to be to change how we set the Tika metadata
> to instead use {{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE}}.
> However if the file isn't successfully detected as an archive I would expect
> Tika to fallback to trying other content detectors rather than bailing out
> early, as this was the behaviour prior to 3.2.2 and tests with this file, and
> other files, were working fine prior to 3.2.2.
> I suspect this bug is most likely related to the fix for TIKA-4424
--
This message was sent by Atlassian Jira
(v8.20.10#820010)