[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151815#comment-17151815 ]
Peter Lee edited comment on COMPRESS-539 at 7/6/20, 7:43 AM: ------------------------------------------------------------- Just have read COMPRESS-449. So this is about the performance. Will revert the IOUtils.skip soon. I'm interested about in which scenario the InputStream.skip would achieve a much better performance than the InputStream.Read? From my view this is mostly happened in InputStream that supports InputStream.seek - the skip could seek instead of reading to skip bytes. _Maybe the only solution for streams that do not override the skip method would be to change the implementation in the InputStream class to use a fix buffer instead always allocate a new one. But there could be a chance that this is expected by the Java library developers._ [~rschimpf] -seems this's a better solution - unfortunately this could only conducted by JDK developers.- Update : I made a twice think about this and believe this may not be a good enouth solution - there're so many input streams that are inherited from the default InputStream. Making the skip buffer a static one may help in your case but may also lead to some unnecessary memory usage. I would be pretty cautious about the static memory malloc if I was the JDK maintainer - it's pretty hard to gc them. Maybe wrapping the InputStream with SkipShieldingInputStream in Commons Compress is a better choice here. was (Author: peterlee): Just have read COMPRESS-449. So this is about the performance. Will revert the IOUtils.skip soon. I'm interested about in which scenario the InputStream.skip would achieve a much better performance than the InputStream.Read? From my view this is mostly happened in InputStream that supports InputStream.seek - the skip could seek instead of reading to skip bytes. _Maybe the only solution for streams that do not override the skip method would be to change the implementation in the InputStream class to use a fix buffer instead always allocate a new one. But there could be a chance that this is expected by the Java library developers._ [~rschimpf] seems this's a better solution - unfortunately this could only conducted by JDK developers. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > --------------------------------------------------------------------------------- > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug > Affects Versions: 1.20 > Reporter: Robin Schimpf > Assignee: Peter Lee > Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, > image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, > image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath()))) { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)