[ https://issues.apache.org/jira/browse/COMPRESS-623?focusedWorklogId=829840&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-829840 ]
ASF GitHub Bot logged work on COMPRESS-623: ------------------------------------------- Author: ASF GitHub Bot Created on: 29/Nov/22 18:12 Start Date: 29/Nov/22 18:12 Worklog Time Spent: 10m Work Description: garydgregory merged PR #306: URL: https://github.com/apache/commons-compress/pull/306 Issue Time Tracking ------------------- Worklog Id: (was: 829840) Time Spent: 2h 20m (was: 2h 10m) > make ZipFile's getRawInputStream usable when local headers are not read > ----------------------------------------------------------------------- > > Key: COMPRESS-623 > URL: https://issues.apache.org/jira/browse/COMPRESS-623 > Project: Commons Compress > Issue Type: Improvement > Reporter: Dawid Weiss > Priority: Minor > Time Spent: 2h 20m > Remaining Estimate: 0h > > I have a somewhat odd use case with gigabytes of ZIP files, each with > thousands of documents (on comparatively slow, network drives). We need to > restructure these ZIPs without the need to recompress files. > The above turns out to work almost perfectly with raw-data copying ZipFile > offers but empirical tests showed a major slowdown in the initial opening of > zip files, linked to multiple reads/seeks for local file headers. If an > option is passed to ignore those headers, raw streams are inaccessible. > I've taken a look at the code and the code in getRawInputStream could > basically do the same thing that getInputStream does - lazily load the > missing offset via getDataOffset(ZipEntry). In fact, getInputStream could > just call getRawInputStream directly, which avoids some code duplication. > I see speedups for opening and copying random raw streams in the order of > 3-4x and all the current tests pass. I filed a PR at github - happy to > discuss it there. > [https://github.com/apache/commons-compress/pull/306] -- This message was sent by Atlassian Jira (v8.20.10#820010)