Robin Schimpf created COMPRESS-539:
--------------------------------------

             Summary: TarArchiveInputStream allocates a lot of memory when 
iterating through an archive
                 Key: COMPRESS-539
                 URL: https://issues.apache.org/jira/browse/COMPRESS-539
             Project: Commons Compress
          Issue Type: Bug
    Affects Versions: 1.20
            Reporter: Robin Schimpf
         Attachments: Don't_call_InputStream#skip.patch, 
Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png

 I iterated through the linux source tar and noticed some unneeded allocations 
happen without extracting any data.

Reproducing code
{code:java}
File tarFile = new File("linux-5.7.1.tar");
    try (TarArchiveInputStream in = new 
TarArchiveInputStream(Files.newInputStream(tarFile.toPath()))) {
        TarArchiveEntry entry;
        while ((entry = in.getNextTarEntry()) != null) {
        }
    }
{code}
The measurement was done on Java 11.0.7 with the Java Flight Recorder. Options 
used: -XX:StartFlightRecording=settings=profile,filename=allocations.jfr

Baseline with the current master implementation:
 Estimated TLAB allocation: 293MiB

!image-2020-06-21-10-58-07-917.png!

1. IOUtils.skip -> input.skip(numToSkip)
 This delegates in my test scenario to the InputStream.skip implementation 
which allocates a new byte[] for every invocation. By simply commenting out the 
while loop which calls the skip method the estimated TLAB allocation drops to 
164MiB (-129MiB).
 !image-2020-06-21-10-58-43-255.png! 
 Commenting out the skip call does not seem to be the best solution but it was 
quick for me to see how much memory can be saved. Also no unit tests where 
failing for me.

2. TarArchiveInputStream.readRecord
 For every read of the record a new byte[] is created. Since the record size 
does not change the byte[] can be reused and created when instantiating the 
TarStream. This optimization is already present in the TarArchiveOutputStream. 
Reusing the buffer reduces the estimated TLAB allocations further to 128MiB 
(-36MiB).
 !image-2020-06-21-10-59-10-825.png!

I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to