tbentleypfpt commented on pull request #356:
URL: https://github.com/apache/tika/pull/356#issuecomment-698517123


   > > In PackageParser if the ArchiveInputStream is a ZipArchiveInputStream, 
iterate through all the zip entries without trying to parse/read them. This 
will make it so we do not write any of the zip contents to the ContentHandler 
which will avoid writing any duplicate content to the ContentHandler if we end 
up having a data descriptor exception and attempt to scan again with the data 
descriptor support enabled.
   > 
   > -1. I don't like this idea. Iterating through the zip without any reading 
would not save more time than reading all entries - we would always extract and 
read the data of the entry(see also 
`ZipArchiveInputStream.drainCurrentEntryData` about this) even through you 
didn't call `ZipArchiveInputStream.read`.
   > 
   > Zip archives could be pretty huge. Iterating them twice may be a time 
consuming job. Comparing with this, finding out a way to reset the 
`ContentHandler` would be more easier.
   
   I agree. I have restored PackageParser to no longer perform the initial 
iteration through the zip and added a throwing and handling of TikaException 
when a zip entry cannot be read because of data descriptor being present. This 
makes it so we do not write to the content handler for this entry until we can 
actually read it.
   
   I marked two of the tests as ignored for the moment until we can get the 
mark(int limit) and reset() to reset the stream to the right place so we can 
retry reading the zip starting from the entry with the data descriptor.
   
   When I call mark before reset (before retrying with a new 
ZipArchiveInputStream) I get an exception with message "Unexpected record 
signature: 0X73696854"


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to