tbentleypfpt commented on pull request #356: URL: https://github.com/apache/tika/pull/356#issuecomment-698517123
> > In PackageParser if the ArchiveInputStream is a ZipArchiveInputStream, iterate through all the zip entries without trying to parse/read them. This will make it so we do not write any of the zip contents to the ContentHandler which will avoid writing any duplicate content to the ContentHandler if we end up having a data descriptor exception and attempt to scan again with the data descriptor support enabled. > > -1. I don't like this idea. Iterating through the zip without any reading would not save more time than reading all entries - we would always extract and read the data of the entry(see also `ZipArchiveInputStream.drainCurrentEntryData` about this) even through you didn't call `ZipArchiveInputStream.read`. > > Zip archives could be pretty huge. Iterating them twice may be a time consuming job. Comparing with this, finding out a way to reset the `ContentHandler` would be more easier. I agree. I have restored PackageParser to no longer perform the initial iteration through the zip and added a throwing and handling of TikaException when a zip entry cannot be read because of data descriptor being present. This makes it so we do not write to the content handler for this entry until we can actually read it. I marked two of the tests as ignored for the moment until we can get the mark(int limit) and reset() to reset the stream to the right place so we can retry reading the zip starting from the entry with the data descriptor. When I call mark before reset (before retrying with a new ZipArchiveInputStream) I get an exception with message "Unexpected record signature: 0X73696854" ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org