[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297123#comment-17297123 ] Peter Lee commented on COMPRESS-539: I think we can close this issue now. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, > image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, > image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296643#comment-17296643 ] Stefan Bodewig commented on COMPRESS-539: - [~peterlee] do you still intend to change anything or can we close this? > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, > image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, > image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152491#comment-17152491 ] Robin Schimpf commented on COMPRESS-539: {{I guess the entries are pretty small (a few kilobytes)}} Your assumption is correct. I used the linux source tarball and there were no differences between {{read}} and {{skip}}. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, > image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, > image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152153#comment-17152153 ] Stefan Bodewig commented on COMPRESS-539: - I'm afraid we've never run any measurements but relied on gut feeling. The assumption here is that {{skip}} on a seekable stream should be faster if skipping more than just a few bytes. I'm not sure how you have constructed the example [~rschimpf] but I guess the entries are pretty small (a few kilobytes). Our (well, my) expectation is that {{read}} in such a case might return cached data as we've just read a tar header from the stream, but if the entry in question was bigger - a few megabytes, maybe - then reading it should be measurably slower than seeking ahead for the same amount of bytes. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, > image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, > image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151815#comment-17151815 ] Peter Lee commented on COMPRESS-539: Just have read COMPRESS-449. So this is about the performance. Will revert the IOUtils.skip soon. I'm interested about in which scenario the InputStream.skip would achieve a much better performance than the InputStream.Read? From my view this is mostly happened in InputStream that supports InputStream.seek - the skip could seek instead of reading to skip bytes. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, > image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, > image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151748#comment-17151748 ] Peter Lee commented on COMPRESS-539: Ah, I'm sorry that I didn't see your reply on 4, Jul. Will look into COMPRESS-449. _> I'm sorry we crossed with our actions mid-flight_ [~peterlee] _. I'm afraid I have to ask you to revert your change to {{IOUtils.skip}}._ Sure. Will look into COMPRESS-449 and revert this afterwards. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, > image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, > image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151659#comment-17151659 ] Robin Schimpf commented on COMPRESS-539: I've rerun my test with the current master branch. This is the baseline of the master before the patch !image-2020-07-05-22-32-15-131.png! After the patch estimated allocation is down to 127MiB which is about the same as with my original patches !image-2020-07-05-22-32-31-511.png! [~bodewig] I also tested SkipShieldingInputStream which has the same positive effect on the memory allocation. Since this all was an arificial test to reduce memory allocation for the "default" usage of the api I don't include my measurements but it is good to know about the existence of this. Sadly the linked issue had no reference to benchmarks or cases where calling the skip method has better performance as I noted no difference in my tests (remember that those where artificial ones). But I can think about some special skip implementations of streams which have to do much less work than the read method. Maybe the only solution for streams that do not override the skip method would be to change the implementation in the InputStream class to use a fix buffer instead always allocate a new one. But there could be a chance that this is expected by the Java library developers. To sum up in my opinion changes to the amount of memory IOUtils.skip allocates would be appreciated but not really neccessary. I just wanted to raise the attention and see if you had any ideas if it could be improved (as I had not really one). > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, > image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, > image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151287#comment-17151287 ] Stefan Bodewig commented on COMPRESS-539: - I'm sorry we crossed with our actions mid-flight [~peterlee] . I'm afraid I have to ask you to revert your change to {{IOUtils.skip}}. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151285#comment-17151285 ] Stefan Bodewig commented on COMPRESS-539: - COMPRESS-449 rather than 443, sorry. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151283#comment-17151283 ] Stefan Bodewig commented on COMPRESS-539: - not calling skip may reduce the amount of memory consumed but will slow things down considerably as skip is usually far more efficient than read. In your case I'd recommend wrapping your input stream into a {{SkipShieldingInputStream}} instead, which will have the same effect as commenting out the skip calls. Re-using the buffer in {{TarArchiveInputStream}} sounds right. No idea why this is different from TarArchiveOutputStream where it seems we've used a shared buffer since forever. [~peterlee] the team has had this discussion around COMPRESS-443 - a while before you joined - and back then explicitly decided we wanted to keep using {{skip}} for performance reasons. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151284#comment-17151284 ] Peter Lee commented on COMPRESS-539: Hey [~rschimpf] . Please have a look at commit 101137bfa0f3ae709c2a2771368b190ceb899ea0 if it helps. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Assignee: Peter Lee >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151216#comment-17151216 ] Peter Lee commented on COMPRESS-539: IIRC the IOUtils.skip was originally copied from Apache Commons IO(not a maven dependency to keep Compress as simple as we want it). I just checked the IOUtils in Apache Commons IO and found out the skip method looks like this: {code:java} /** * Skips characters from an input character stream. * This implementation guarantees that it will read as many characters * as possible before giving up; this may not always be the case for * skip() implementations in subclasses of {@link Reader}. * * Note that the implementation uses {@link Reader#read(char[], int, int)} rather * than delegating to {@link Reader#skip(long)}. * This means that the method may be considerably less efficient than using the actual skip implementation, * this is done to guarantee that the correct number of characters are skipped. * * * @param input character stream to skip * @param toSkip number of characters to skip. * @return number of characters actually skipped. * @throws IOException if there is a problem reading the file * @throws IllegalArgumentException if toSkip is negative * @see Reader#skip(long) * @see https://issues.apache.org/jira/browse/IO-203";>IO-203 - Add skipFully() method for InputStreams * @since 2.0 */ public static long skip(final Reader input, final long toSkip) throws IOException { if (toSkip < 0) { throw new IllegalArgumentException("Skip count must be non-negative, actual: " + toSkip); } /* * N.B. no need to synchronize this because: - we don't care if the buffer is created multiple times (the data * is ignored) - we always use the same size buffer, so if it it is recreated it will still be OK (if the buffer * size were variable, we would need to synch. to ensure some other thread did not create a smaller one) */ if (SKIP_CHAR_BUFFER == null) { SKIP_CHAR_BUFFER = new char[SKIP_BUFFER_SIZE]; } long remain = toSkip; while (remain > 0) { // See https://issues.apache.org/jira/browse/IO-203 for why we use read() rather than delegating to skip() final long n = input.read(SKIP_CHAR_BUFFER, 0, (int) Math.min(remain, SKIP_BUFFER_SIZE)); if (n < 0) { // EOF break; } remain -= n; } return toSkip - remain; } {code} Not calling inputstream.skip as you did in your patch. I believe this implemention would get the same memory benefit as your patch did. BTW seems the IOUtils methods in Commons Compress are somehow out of date - maybe we can update them using reference from Commons IO. > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusi
[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive
[ https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151207#comment-17151207 ] Peter Lee commented on COMPRESS-539: Looks great. Would you create a github PR so the tests could be applied? > TarArchiveInputStream allocates a lot of memory when iterating through an > archive > - > > Key: COMPRESS-539 > URL: https://issues.apache.org/jira/browse/COMPRESS-539 > Project: Commons Compress > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Robin Schimpf >Priority: Major > Attachments: Don't_call_InputStream#skip.patch, > Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, > image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png > > > I iterated through the linux source tar and noticed some unneeded > allocations happen without extracting any data. > Reproducing code > {code:java} > File tarFile = new File("linux-5.7.1.tar"); > try (TarArchiveInputStream in = new > TarArchiveInputStream(Files.newInputStream(tarFile.toPath( { > TarArchiveEntry entry; > while ((entry = in.getNextTarEntry()) != null) { > } > } > {code} > The measurement was done on Java 11.0.7 with the Java Flight Recorder. > Options used: > -XX:StartFlightRecording=settings=profile,filename=allocations.jfr > Baseline with the current master implementation: > Estimated TLAB allocation: 293MiB > !image-2020-06-21-10-58-07-917.png! > 1. IOUtils.skip -> input.skip(numToSkip) > This delegates in my test scenario to the InputStream.skip implementation > which allocates a new byte[] for every invocation. By simply commenting out > the while loop which calls the skip method the estimated TLAB allocation > drops to 164MiB (-129MiB). > !image-2020-06-21-10-58-43-255.png! > Commenting out the skip call does not seem to be the best solution but it > was quick for me to see how much memory can be saved. Also no unit tests > where failing for me. > 2. TarArchiveInputStream.readRecord > For every read of the record a new byte[] is created. Since the record size > does not change the byte[] can be reused and created when instantiating the > TarStream. This optimization is already present in the > TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB > allocations further to 128MiB (-36MiB). > !image-2020-06-21-10-59-10-825.png! > I attached the patches I used so the results can be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005)