[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2021-03-07 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297123#comment-17297123
 ] 

Peter Lee commented on COMPRESS-539:


I think we can close this issue now.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, 
> image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, 
> image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2021-03-06 Thread Stefan Bodewig (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296643#comment-17296643
 ] 

Stefan Bodewig commented on COMPRESS-539:
-

[~peterlee] do you still intend to change anything or can we close this?

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, 
> image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, 
> image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-06 Thread Robin Schimpf (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152491#comment-17152491
 ] 

Robin Schimpf commented on COMPRESS-539:


{{I guess the entries are pretty small (a few kilobytes)}}

Your assumption is correct. I used the linux source tarball and there were no 
differences between {{read}} and {{skip}}.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, 
> image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, 
> image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-06 Thread Stefan Bodewig (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152153#comment-17152153
 ] 

Stefan Bodewig commented on COMPRESS-539:
-

I'm afraid we've never run any measurements but relied on gut feeling.

The assumption here is that {{skip}} on a seekable stream should be faster if 
skipping more than just a few bytes. I'm not sure how you have constructed the 
example [~rschimpf] but I guess the entries are pretty small (a few kilobytes). 
Our (well, my) expectation is that {{read}} in such a case might return cached 
data as we've just read a tar header from the stream, but if the entry in 
question was bigger - a few megabytes, maybe - then reading it should be 
measurably slower than seeking ahead for the same amount of bytes.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, 
> image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, 
> image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-06 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151815#comment-17151815
 ] 

Peter Lee commented on COMPRESS-539:


Just have read COMPRESS-449. So this is about the performance. Will revert the 
IOUtils.skip soon.

I'm interested about in which scenario the InputStream.skip would achieve a 
much better performance than the InputStream.Read? From my view this is mostly 
happened in InputStream that supports InputStream.seek - the skip could seek 
instead of reading to skip bytes.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, 
> image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, 
> image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-05 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151748#comment-17151748
 ] 

Peter Lee commented on COMPRESS-539:


Ah, I'm sorry that I didn't see your reply on 4, Jul.

Will look into COMPRESS-449.

 

_> I'm sorry we crossed with our actions mid-flight_ [~peterlee] _. I'm afraid 
I have to ask you to revert your change to {{IOUtils.skip}}._

Sure. Will look into COMPRESS-449 and revert this afterwards.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, 
> image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, 
> image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-05 Thread Robin Schimpf (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151659#comment-17151659
 ] 

Robin Schimpf commented on COMPRESS-539:


I've rerun my test with the current master branch. This is the baseline of the 
master before the patch

  !image-2020-07-05-22-32-15-131.png!

After the patch estimated allocation is down to 127MiB which is about the same 
as with my original patches

  !image-2020-07-05-22-32-31-511.png!

[~bodewig] I also tested SkipShieldingInputStream which has the same positive 
effect on the memory allocation. Since this all was an arificial test to reduce 
memory allocation for the "default" usage of the api I don't include my 
measurements but it is good to know about the existence of this.

Sadly the linked issue had no reference to benchmarks or cases where calling 
the skip method has better performance as I noted no difference in my tests 
(remember that those where artificial ones). But I can think about some special 
skip implementations of streams which have to do much less work than the read 
method.

Maybe the only solution for streams that do not override the skip method would 
be to change the implementation in the InputStream class to use a fix buffer 
instead always allocate a new one. But there could be a chance that this is 
expected by the Java library developers.

 

To sum up in my opinion changes to the amount of memory IOUtils.skip allocates 
would be appreciated but not really neccessary. I just wanted to raise the 
attention and see if you had any ideas if it could be improved (as I had not 
really one).

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png, 
> image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png, 
> image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-04 Thread Stefan Bodewig (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151287#comment-17151287
 ] 

Stefan Bodewig commented on COMPRESS-539:
-

I'm sorry we crossed with our actions mid-flight [~peterlee] . I'm afraid I 
have to ask you to revert your change to {{IOUtils.skip}}.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-04 Thread Stefan Bodewig (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151285#comment-17151285
 ] 

Stefan Bodewig commented on COMPRESS-539:
-

COMPRESS-449 rather than 443, sorry.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-04 Thread Stefan Bodewig (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151283#comment-17151283
 ] 

Stefan Bodewig commented on COMPRESS-539:
-

not calling skip may reduce the amount of memory consumed but will slow things 
down considerably as skip is usually far more efficient than read. In your case 
I'd recommend wrapping your input stream into a {{SkipShieldingInputStream}} 
instead, which will have the same effect as commenting out the skip calls.

Re-using the buffer in {{TarArchiveInputStream}} sounds right. No idea why this 
is different from TarArchiveOutputStream where it seems we've used a shared 
buffer since forever.

[~peterlee] the team has had this discussion around COMPRESS-443 - a while 
before you joined - and back then explicitly decided we wanted to keep using 
{{skip}} for performance reasons.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-04 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151284#comment-17151284
 ] 

Peter Lee commented on COMPRESS-539:


Hey [~rschimpf] . Please have a look at commit 
101137bfa0f3ae709c2a2771368b190ceb899ea0 if it helps.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Assignee: Peter Lee
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-04 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151216#comment-17151216
 ] 

Peter Lee commented on COMPRESS-539:


IIRC the IOUtils.skip was originally copied from Apache Commons IO(not a maven 
dependency to keep Compress as simple as we want it).

I just checked the IOUtils in Apache Commons IO and found out the skip method 
looks like this:
{code:java}
/**
 * Skips characters from an input character stream.
 * This implementation guarantees that it will read as many characters
 * as possible before giving up; this may not always be the case for
 * skip() implementations in subclasses of {@link Reader}.
 * 
 * Note that the implementation uses {@link Reader#read(char[], int, int)} 
rather
 * than delegating to {@link Reader#skip(long)}.
 * This means that the method may be considerably less efficient than using the 
actual skip implementation,
 * this is done to guarantee that the correct number of characters are skipped.
 * 
 *
 * @param input character stream to skip
 * @param toSkip number of characters to skip.
 * @return number of characters actually skipped.
 * @throws IOException  if there is a problem reading the file
 * @throws IllegalArgumentException if toSkip is negative
 * @see Reader#skip(long)
 * @see https://issues.apache.org/jira/browse/IO-203";>IO-203 - Add 
skipFully() method for InputStreams
 * @since 2.0
 */
public static long skip(final Reader input, final long toSkip) throws 
IOException {
if (toSkip < 0) {
throw new IllegalArgumentException("Skip count must be non-negative, 
actual: " + toSkip);
}
/*
 * N.B. no need to synchronize this because: - we don't care if the buffer 
is created multiple times (the data
 * is ignored) - we always use the same size buffer, so if it it is 
recreated it will still be OK (if the buffer
 * size were variable, we would need to synch. to ensure some other thread 
did not create a smaller one)
 */
if (SKIP_CHAR_BUFFER == null) {
SKIP_CHAR_BUFFER = new char[SKIP_BUFFER_SIZE];
}
long remain = toSkip;
while (remain > 0) {
// See https://issues.apache.org/jira/browse/IO-203 for why we use 
read() rather than delegating to skip()
final long n = input.read(SKIP_CHAR_BUFFER, 0, (int) Math.min(remain, 
SKIP_BUFFER_SIZE));
if (n < 0) { // EOF
break;
}
remain -= n;
}
return toSkip - remain;
}
{code}
Not calling inputstream.skip as you did in your patch. I believe this 
implemention would get the same memory benefit as your patch did.

BTW seems the IOUtils methods in Commons Compress are somehow out of date - 
maybe we can update them using reference from Commons IO.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusi

[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

2020-07-04 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151207#comment-17151207
 ] 

Peter Lee commented on COMPRESS-539:


Looks great. Would you create a github PR so the tests could be applied?

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> -
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath( {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)