Re: RFR(L) 8237354: Add option to jcmd to write a gzipped heap dump

Yasumasa Suenaga Thu, 20 Feb 2020 06:52:43 -0800

Hi Ralf,

On 2020/02/20 22:21, Schmelter, Ralf wrote:

Hi Yasumasa,


I think it would be great if we could redirect larger chunks data to jcmd.

But you have to differentiate between binary data (for the heap dump) and text 
data (for the e.g. codelist).

Currently jcmd assumes all bytes to be UTF-8 encoded, converts them to Unicode 
and then uses the platform encoding to write characters. This is not suitable 
for binary data.

And of course you cannot use the bufferedStream to get the output to jcmd. You 
would have to implement an outputStream which can directly write to the 
AttachListener connection.


I've understood it, but I think we can implement new class which extends 
outputStream or bufferedStream.
In jcmd side, we can switch the method to handle binary or text data.
In HotSpot side, we can switch stream class to use with parameter(s) from 
frontend (jcmd).

But even with this change, I would still like the gzip compression to be done 
in the VM. Let me try to list all the advantages I see for doing this:

1. It is by far the easiest to use. You just have to specify -gz for the jcmd. 
While your command line (jcmd .... | gzip -c > file) is easy enough, it assumes 
you have gzip (not by default on Windows) and it would be painfully slow (~  10 x 
and more), since it is not parallel. You could use pigz, but it is not as 
ubiquitous as gzip. I know it is sometimes hard to image this could be a problem 
for anyone, but it is.

It is easy to tell a customer to execute jcmd <pid> GC.heap_dump -gz 
test.hprof.gz. Adding additional requirements, especially if it is external programs, 
and your chance of success diminish fast.


As an troubleshooter, I agree with you to ease of use and ease of instruction 
for customers.
But we can clear your concern if we provide command examples or shell script to 
collect data.
In case of modern Windows, tar (of course, it includes -z option) is available. 
we can compress heap dump with it.

  
https://techcommunity.microsoft.com/t5/containers/tar-and-curl-come-to-windows/ba-p/382409

2. The -XX:HeapDumpOnOutOfMemoryError, -XX: HeapDumpBeforeFullGC and -XX: 
HeapDumpAfterFullGC options can easily create gzipped heap dumps directly when 
the compression is in the VM. And especially if you create more than one dump 
(with the before/after gc flags), compression is very useful. Or if you want to 
support compressed heap dumps it in the HotSpotDiagnosticMXBean. Just add a 
flag and/or compression level.


Do you have experience about HeapDumpBeforeFullGC and/or HeapDumpAfterFullGC?
I guess they are not used in production environment.

I recommend my customers to use -XX:HeapDumpOnOutOfMemoryError, but also we can 
use -XX:OnOutOfMemoryError.
If disk is enough to dump, we can invoke `gz` via -XX:OnOutOfMemoryError. It 
calls after HeapDumpOnOutOfMemoryError.

3. The created gz-file is not a simple gz-file you would get when simply using
gzip.

It is created in a way that makes it possible to treat it like a random
access file without decompressing it.

Currently for example the Eclipse Memory Analyzer (MAT) has the option to
directly open a gzipped hprof file and use it without decompression. And for
the initial parsing, they can just read the file sequentially, so this is not
too slow.

But when accessing the values of objects or arrays, they have to seek to
specific positions in the gzipped hprof file. This is currently implemented by
having a Java implementation of a InflaterInputStream which is capable to
completely copy its state. This copy is then used to start decompressing at the
specific offset for which is was created. As you can imagine, the state of the
inflater is not small (MAT assumes about 64Kb, 32kB is needed at least for the
dictionary), so it limits the number of starting positions you can use for
large files. But it works for all kinds of gzip compressed streams.

The gzip implementation used to write the heap dump in the VM creates many
small gzip compressed chunks. At the start of each chunk you can create a fresh
GZIPInputStream without having to store any internal state. You only need to
remember the physical offset and the logical offset (so 2 long values) for each
chunk. If you then want to read data at a specific logical offset, you binary
search the nearest preceding chunk and create a GZIPInputStream reading from
the physical offset of that chunk. So on average you have to decompress about
half a chunk to get to the data you need.

If you look in the in webrev, you can see
http://cr.openjdk.java.net/~rschmelter/webrevs/8237354/webrev.0/test/lib/jdk/test/lib/hprof/parser/GzipRandomAccess.java.html.
This implements the needed logic to treat the gzipped hprof file as a random
access file. I have used it to add support for gzipped files in the jhat
library (which is only used in tests). In jhat hat for example, the resolution
of references is done via random access. And the file also contains all the
functionality MAT would need.


I've used MAT for analyzing heap dump, and I usually check various objects in 
it.
AFAIK heap dump is heap snapshot. So we need to traverse it entirely, isn't it?
If so, we need to decompress heap dump entirely in actually.

You can generate a more or less equivalent file if you use pigz with the 
--independent option. But to make it easier to detect that the gzip file is 
chunked (without decompressing it first), I've added a comment marking it as a 
hprof file with a given chunk size. This would be missing from the pigz file, 
but they instead adding 9 bytes when --independent is specified (00 00 ff ff 00 
00 00 ff ff), so you could detect it too.


Is it in spec of gzip?
I'm not familiar of gzip, but I concern if it is specialized for something.

To summarize, the gzipped hprof file created by the VM makes it much easier for 
tools to access them efficiently at random positions. You can do something 
equivalent with pigz, but not with gzip.

And getting support for this type of gzipped hprof file by the heap dump tools 
will be much easier, if this is the format the openjdk produces, so it will be 
widespread.


I think it is a balance between implementation/maintenance cost of your change 
and ease of use/disk space reduction.

In case of Linux, we can redirect archiver/compressor with 
/proc/sys/kernel/core_pattern.
IMHO it is nature if heap dump handles as same as memory dump.


Thanks,

Yasumasa

Best regards,
Ralf

-----Original Message-----
From: Yasumasa Suenaga <suen...@oss.nttdata.com>
Sent: Donnerstag, 20. Februar 2020 00:59
To: Ioi Lam <ioi....@oracle.com>; Schmelter, Ralf <ralf.schmel...@sap.com>; 
serguei.spit...@oracle.com; hotspot-runtime-...@openjdk.java.net runtime 
<hotspot-runtime-...@openjdk.java.net>
Cc: serviceability-dev@openjdk.java.net
Subject: Re: RFR(L) 8237354: Add option to jcmd to write a gzipped heap dump

Hi,

Generally I agree with Ioi, but I think it is not a problem only for gzipped 
heap dump.

For example, Compiler.codelist and Compiler.CodeHeap_Analytics might be large 
text.
In addition, some users want to redirect the result from jcmd to other command 
or log collector.

So I think it would be better if jcmd provides stdout redurect option to all 
subocmmands. E.g.

    $ jcmd <PID> GC.heap_dump -stdout | gzip -c - > heapdump.hprof.gz

Thanks,

Yasumasa

Re: RFR(L) 8237354: Add option to jcmd to write a gzipped heap dump

Reply via email to