tbentleypfpt edited a comment on pull request #356: URL: https://github.com/apache/tika/pull/356#issuecomment-698613886
> > finding out a way to reset the ContentHandler would be more easier. > > We were thinking about adding a resettable content handler to Tika 2.0, but it is really tricky and adds some complexity. If you can find a solution, great, but I'd be happy enough throwing the exception and then putting it on the client code to rerun the parse with PKGParser configured to allow for data descriptors. Having the client code handle the exception / re-running the parsing seems like the better option at this point. My initial thought is to create a ZipArchiveInputStreamFactory in commons-compress in which you can set values for the various ZipArchiveInputStream parameters, and to have methods to generate a ZipArchiveInputStream using those parameters. Then when the data descriptor exception is thrown and caught, the client can create a ZipArchiveInputStreamFactory with the data descriptor feature enabled and set that in the ParseContext. Then in PackageParser, when we go to create an ArchiveInputStream, if the archive type is zip and a ZipArchiveInputStreamFactory is defined in the ParseContext then use the ZipArchiveInputStreamFactory to generate the ZipArchiveInputStream. Otherwise use the ArchiveStreamFactory. Something like this: ` // in commons-compress import java.io.InputStream; import static org.apache.commons.compress.archivers.zip.ZipEncodingHelper.UTF8; public class ZipArchiveInputStreamFactory { private String encoding = UTF8; private boolean allowStoredEntriesWithDataDescriptor = false; private boolean useUnicodeExtraFields = true; private boolean skipSplitSig = false; public ZipArchiveInputStreamFactory(){} public void setEncoding(String encoding) { this.encoding = encoding; } public void setAllowStoredEntriesWithDataDescriptor(boolean allowStoredEntriesWithDataDescriptor) { this.allowStoredEntriesWithDataDescriptor = allowStoredEntriesWithDataDescriptor; } public void setUseUnicodeExtraFields(boolean useUnicodeExtraFields) { this.useUnicodeExtraFields = useUnicodeExtraFields; } public void setSkipSplitSig(boolean skipSplitSig) { this.skipSplitSig = skipSplitSig; } public ZipArchiveInputStream createZipArchiveInputStream(InputStream stream) { return new ZipArchiveInputStream(stream, encoding, useUnicodeExtraFields, allowStoredEntriesWithDataDescriptor, skipSplitSig); } } // PackageParser.java ... try { ArchiveStreamFactory factory = context.get(ArchiveStreamFactory.class, new ArchiveStreamFactory()); // At the end we want to close the archive stream to release // any associated resources, but the underlying document stream // should not be closed CloseShieldInputStream csis = new CloseShieldInputStream(stream); ZipArchiveInputStreamFactory zipArchiveInputStreamFactory = context.get(ZipArchiveInputStreamFactory); if (ArchiveStreamFactory.detect(csis).equalsIgnoreCase("zip") && zipArchiveInputStreamFactory != null) { ais = zipArchiveInputStreamFactory.createZipArchiveInputStream(csis); } else { ais = factory.createArchiveInputStream(csis); } } ` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org