[ 
https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202175#comment-17202175
 ] 

ASF GitHub Bot commented on TIKA-3196:
--------------------------------------

PeterAlfredLee opened a new pull request #364:
URL: https://github.com/apache/tika/pull/364


   When reading a zip archive entry with STORED and Data Descriptor, a 
UnsupportedZipFeatureException would be thrown. We can save the number of 
entries we have already read, reset the stream, and open the 
ZipArchieInputStream again with Data Descriptor allowed. Then we can finish 
reading the rest of the entries.
   
   1. I set a limit of 100MB using variable `MARK_LIMIT`, which is used for 
`stream.mark`.
   2. The `entryCnt` is used for storing the number of entries we have read.
   3. I modified `parseEntry` a little bit : nothing would be written to 
`xhtml` if a zip entry uses `STORED` and `Data Descriptor` at the same time. 
Instread an exception is thrown and the stream would be `reset` and read for a 
second time.
   4. I have generated a zip archive for test. This zip contains 5 entries. The 
2nd and 4th entry in the zip archive are using  `STORED` with `Data 
Descriptor`. This zip archive could be successfully parsed.
   
   See also [#356](https://github.com/apache/tika/pull/356) and [Commons 
Compress #137](https://github.com/apache/commons-compress/pull/137) for more 
information.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PackageParser should attempt to parse entries from zip files with STORED 
> entries with data descriptor
> -----------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3196
>                 URL: https://issues.apache.org/jira/browse/TIKA-3196
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Trevor Bentley
>            Priority: Major
>         Attachments: OOO-107047-0.oxt-145.zip
>
>
> We are currently using tika for text extraction. Currently some sites are 
> returning zips that have entries with stored data descriptors which fail to 
> extract due to the ZipArchiveInputStream (in commons-compress) defaulting to 
> false for 'allowStoredEntriesWithDataDescriptor'.
> Since ZipArchiveInputStream has support for reading zips with data 
> descriptors we should attempt to read the zip with that feature enabled when 
> we get a data descriptor UnsupportedZipFeatureException.
> Pull Request: 
> [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to