PeterAlfredLee commented on pull request #356: URL: https://github.com/apache/tika/pull/356#issuecomment-698075634
> In PackageParser if the ArchiveInputStream is a ZipArchiveInputStream, iterate through all the zip entries without trying to parse/read them. This will make it so we do not write any of the zip contents to the ContentHandler which will avoid writing any duplicate content to the ContentHandler if we end up having a data descriptor exception and attempt to scan again with the data descriptor support enabled. -1. I don't like this idea. Iterating through the zip without any reading would not save more time than reading all entries - we would always extract and read the data of the entry(see also `ZipArchiveInputStream.drainCurrentEntryData` about this) even through you didn't call `ZipArchiveInputStream.read`. Zip archives could be pretty huge. Iterating them twice may be a time consuming job. Comparing with this, finding out a way to reset the `ContentHandler` would be more easier. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org