[jira] [Comment Edited] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

Peter Lee (Jira) Mon, 06 Jul 2020 00:44:06 -0700


    [ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151815#comment-17151815
 ]


Peter Lee edited comment on COMPRESS-539 at 7/6/20, 7:43 AM:
-------------------------------------------------------------

Just have read COMPRESS-449. So this is about the performance. Will revert the 
IOUtils.skip soon.

I'm interested about in which scenario the InputStream.skip would achieve a 
much better performance than the InputStream.Read? From my view this is mostly 
happened in InputStream that supports InputStream.seek - the skip could seek 
instead of reading to skip bytes.

 

_Maybe the only solution for streams that do not override the skip method would 
be to change the implementation in the InputStream class to use a fix buffer 
instead always allocate a new one. But there could be a chance that this is 
expected by the Java library developers._

[~rschimpf] -seems this's a better solution - unfortunately this could only 
conducted by JDK developers.-

Update : I made a twice think about this and believe this may not be a good 
enouth solution - there're so many input streams that are inherited from the 
default InputStream. Making the skip buffer a static one may help in your case 
but may also lead to some unnecessary memory usage. I would be pretty cautious 
about the static memory malloc if I was the JDK maintainer - it's pretty hard 
to gc them.

Maybe wrapping the InputStream with SkipShieldingInputStream in Commons 
Compress is a better choice here.


was (Author: peterlee):
Just have read COMPRESS-449. So this is about the performance. Will revert the 
IOUtils.skip soon.

I'm interested about in which scenario the InputStream.skip would achieve a 
much better performance than the InputStream.Read? From my view this is mostly 
happened in InputStream that supports InputStream.seek - the skip could seek 
instead of reading to skip bytes.

 

_Maybe the only solution for streams that do not override the skip method would 
be to change the implementation in the InputStream class to use a fix buffer 
instead always allocate a new one. But there could be a chance that this is 
expected by the Java library developers._

[~rschimpf] seems this's a better solution - unfortunately this could only 
conducted by JDK developers.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> ---------------------------------------------------------------------------------
>
>                 Key: COMPRESS-539
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-539
>             Project: Commons Compress
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: Robin Schimpf
>            Assignee: Peter Lee
>            Priority: Major
>         Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, 
> image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, 
> image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
>     try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath()))) {
>         TarArchiveEntry entry;
>         while ((entry = in.getNextTarEntry()) != null) {
>         }
>     }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

Reply via email to