PeterAlfredLee commented on pull request #356:
URL: https://github.com/apache/tika/pull/356#issuecomment-698075634


   > In PackageParser if the ArchiveInputStream is a ZipArchiveInputStream, 
iterate through all the zip entries without trying to parse/read them. This 
will make it so we do not write any of the zip contents to the ContentHandler 
which will avoid writing any duplicate content to the ContentHandler if we end 
up having a data descriptor exception and attempt to scan again with the data 
descriptor support enabled.
   
   -1. I don't like this idea. Iterating through the zip without any reading 
would not save more time than reading all entries - we would always extract and 
read the data of the entry(see also 
`ZipArchiveInputStream.drainCurrentEntryData` about this) even through you 
didn't call `ZipArchiveInputStream.read`.
   
   Zip archives could be pretty huge. Iterating them twice may be a time 
consuming job. Comparing with this, finding out a way to reset the 
`ContentHandler` would be more easier.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to