[
https://issues.apache.org/jira/browse/TIKA-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18066329#comment-18066329
]
Tim Allison commented on TIKA-4563:
-----------------------------------
We're still missing zip content in S4P6, B7TH, QMXI. These are all truncated
zips. The issue is that we switched from streaming to random access reading of
the central directory in zips. This is more robust for non-truncated zips, but
there's a problem.
When loading a zip as a file, commons compress reads from the end of the file
trying to find a EOCD -- a pointer to the central directory. In these three
truncated files that don't have an eocd or a central directory, compress is
finding the bytes for the EOCD in a compressed stream and then, even though
they don't point to a legit entry, commons compress appears to be reading a
single entry without throwing an exception. We need to follow up on this issue,
but I think we should let it be for 3.3.0. I think we gain much more by
switching to reading zips via the central directory.
Once I push the recent fixes (move to poi-ooxml-full and add file names even if
there's a stream exception), should I roll 3.3.0, do we need another full
regression run, are we ok with my recommendation about living with suboptimal
handling of truncated zips that appear to have eocd markers in their compressed
data for now?
> Prep for 3.3.0 release
> ----------------------
>
> Key: TIKA-4563
> URL: https://issues.apache.org/jira/browse/TIKA-4563
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: kio5_perldoc.mo, tika-3.3.0-20260110.tgz,
> tika-3.3.0-reports.tgz, tika-3.3.0.tgz, tika-3.3.0c.tgz
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)