tbentleypfpt edited a comment on pull request #356:
URL: https://github.com/apache/tika/pull/356#issuecomment-698613886


   > > finding out a way to reset the ContentHandler would be more easier.
   > 
   > We were thinking about adding a resettable content handler to Tika 2.0, 
but it is really tricky and adds some complexity. If you can find a solution, 
great, but I'd be happy enough throwing the exception and then putting it on 
the client code to rerun the parse with PKGParser configured to allow for data 
descriptors.
   
   Having the client code handle the exception / re-running the parsing seems 
like the better option at this point. My initial thought is to create a 
ZipArchiveInputStreamFactory in commons-compress in which you can set values 
for the various ZipArchiveInputStream parameters, and to have methods to 
generate a ZipArchiveInputStream using those parameters.
   
   Then when the data descriptor exception is thrown and caught, the client can 
create a ZipArchiveInputStreamFactory with the data descriptor feature enabled 
and set that in the ParseContext.
   
   Then in PackageParser, when we go to create an ArchiveInputStream, if the 
archive type is zip and a ZipArchiveInputStreamFactory is defined in the 
ParseContext then use the ZipArchiveInputStreamFactory to generate the 
ZipArchiveInputStream. Otherwise use the ArchiveStreamFactory.
   
   Something like this:
   `    // in commons-compress
       import java.io.InputStream;
   
       import static 
org.apache.commons.compress.archivers.zip.ZipEncodingHelper.UTF8;
   
       public class ZipArchiveInputStreamFactory {
       private String encoding = UTF8;
       private boolean allowStoredEntriesWithDataDescriptor = false;
       private boolean useUnicodeExtraFields = true;
       private boolean skipSplitSig = false;
   
       public ZipArchiveInputStreamFactory(){}
   
       public void setEncoding(String encoding) {
           this.encoding = encoding;
       }
   
       public void setAllowStoredEntriesWithDataDescriptor(boolean 
allowStoredEntriesWithDataDescriptor) {
           this.allowStoredEntriesWithDataDescriptor = 
allowStoredEntriesWithDataDescriptor;
       }
   
       public void setUseUnicodeExtraFields(boolean useUnicodeExtraFields) {
           this.useUnicodeExtraFields = useUnicodeExtraFields;
       }
   
       public void setSkipSplitSig(boolean skipSplitSig) {
           this.skipSplitSig = skipSplitSig;
       }
   
       public ZipArchiveInputStream createZipArchiveInputStream(InputStream 
stream) {
           return new ZipArchiveInputStream(stream, encoding, 
useUnicodeExtraFields, allowStoredEntriesWithDataDescriptor, skipSplitSig);
       }
       }
   
   
       // PackageParser.java
       ...
       try {
               ArchiveStreamFactory factory = 
context.get(ArchiveStreamFactory.class, new ArchiveStreamFactory());
               // At the end we want to close the archive stream to release
               // any associated resources, but the underlying document stream
               // should not be closed
   
               CloseShieldInputStream csis = new CloseShieldInputStream(stream);
               ZipArchiveInputStreamFactory zipArchiveInputStreamFactory = 
context.get(ZipArchiveInputStreamFactory);
               if (ArchiveStreamFactory.detect(csis).equalsIgnoreCase("zip") && 
zipArchiveInputStreamFactory != null) {
                   ais = 
zipArchiveInputStreamFactory.createZipArchiveInputStream(csis);
               } else {
                   ais = factory.createArchiveInputStream(csis);
               }
           }
   `


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to