[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830551#comment-17830551
 ] 

Tim Allison edited comment on TIKA-4221 at 3/25/24 5:09 PM:
------------------------------------------------------------

This is caused by a modification of pack200's Archive class. In 
commons-compress 1.25.0, the inputstream was wrapped as a 
CloseShieldInputStream and then not closed. Starting in 1.26.0, there's code 
that unwraps FIlterInputStreams to get down to the source stream. This means 
that this now defeats CloseShieldInputStream, and the underlying stream is 
closed.

See: 
https://github.com/apache/commons-compress/blob/68cd2e7fb488b4ad8a9fdc81cae97ae6e8248ea5/src/main/java/org/apache/commons/compress/harmony/unpack200/Pack200UnpackerAdapter.java#L66

This only causes problems when an pack200 file is embedded in another file with 
an ArchiveInputStream, which is why it is happening so rarely in our corpus.

That said, this is less than ideal.

We can probably work around this by writing our own CloseShieldInputStream that 
doesn't extend FilterInputStream. 


was (Author: talli...@mitre.org):
This is caused by a modification of unpack200's Archive class. In 
commons-compress 1.25.0, the inputstream was wrapped as a 
CloseShieldInputStream and then not closed. Starting in 1.26.0, there's code 
that unwraps FIlterInputStreams to get down to the source stream. This means 
that this now defeats CloseShieldInputStream, and the underlying stream is 
closed.

See: 
https://github.com/apache/commons-compress/blob/68cd2e7fb488b4ad8a9fdc81cae97ae6e8248ea5/src/main/java/org/apache/commons/compress/harmony/unpack200/Pack200UnpackerAdapter.java#L66

This only causes problems when an unpack200 file is embedded in another file 
with an ArchiveInputStream, which is why it is happening so rarely in our 
corpus.

That said, this is less than ideal.

We can probably work around this by writing our own CloseShieldInputStream that 
doesn't extend FilterInputStream. 

> Regression in pack200 parsing in commons-compress
> -------------------------------------------------
>
>                 Key: TIKA-4221
>                 URL: https://issues.apache.org/jira/browse/TIKA-4221
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There's a regression in pack200 that leads to the InputStream being closed 
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem 
> is in pack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>       at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>       at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>       at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>       at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>       at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>       at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>       at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>       at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>       at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>       at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>       at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>       at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>       at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>       at java.io.FilterInputStream.available(FilterInputStream.java:168)
>       at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>       at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>       at java.io.FilterInputStream.available(FilterInputStream.java:168)
>       at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>       at java.io.FilterInputStream.available(FilterInputStream.java:168)
>       at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>       at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>       at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>       at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>       at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>       at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>       at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>       ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to