Anton Vinogradov created IGNITE-28848:
-----------------------------------------
Summary: ZooKeeper discovery: stream marshalZip through
DeflaterOutputStream to cut allocations and peak memory
Key: IGNITE-28848
URL: https://issues.apache.org/jira/browse/IGNITE-28848
Project: Ignite
Issue Type: Task
Reporter: Anton Vinogradov
Assignee: Anton Vinogradov
h3. Problem
{\{ZookeeperDiscoveryImpl.marshalZip()}} materializes the whole uncompressed
marshalled form before compressing it:
{code:java}
return zip(U.marshal(marsh, obj));
{code}
For an uncompressed size of N bytes a single call allocates:
* in \{{U.marshal}} — the marshalling buffer grown to >= N plus a trim copy of
N (\{{GridByteArrayOutputStream.toByteArray()}});
* in \{{zip()}} — a deflate buffer of N (\{{new byte[bytes.length]}}) and an
output \{{GridByteArrayOutputStream}} pre-sized to N, plus the final trim copy.
That is roughly 6x the uncompressed size in garbage and ~3x in peak live memory
(marshalled bytes, deflate buffer and output buffer are alive simultaneously).
\{{Deflater.end()}} is never called, so zlib native memory is released only by
the Cleaner after a GC.
{\{marshalZip()}} runs on the discovery control plane:
* joining node data and the coordinator's data-for-joined — megabytes in
clusters with hundreds of caches (large enough to be split across znodes by
\{{jute.maxbuffer}});
* the coordinator re-marshals \{{ZkDiscoveryEventsData}} on every discovery
event;
* security credentials / security subject.
h3. Change
Stream the marshaller output straight through the compressor, so the
uncompressed form is never materialized:
{code:java}
GridByteArrayOutputStream out = new GridByteArrayOutputStream();
try (BufferedOutputStream zipOut = new BufferedOutputStream(new
DeflaterOutputStream(out))) {
U.marshal(marsh, obj, zipOut);
}
return out.toByteArray();
{code}
* The \{{BufferedOutputStream}} is essential: \{{ObjectOutputStream}} writes in
~1 KB blocks, and an unbuffered \{{DeflaterOutputStream}} pays a JNI deflate
call per block. The unbuffered variant was measured and rejected: +14-20% time
vs the current code (25.5 vs 21.3 us/op on a 1.6 KB payload, 20,114 vs 17,661
us/op on 1.66 MB). The default 8 KB buffer coalesces the blocks and brings the
time on par with or below the current code.
* \{{DeflaterOutputStream.close()}} ends the \{{Deflater}} it owns, so zlib
native memory is released deterministically instead of waiting for the Cleaner.
* The private \{{zip()}} helper is removed (\{{marshalZip}} was its only
caller); \{{unzip()}} stays — it is still used for
\{{ATTR_SECURITY_SUBJECT_V2}}. The read side (\{{unmarshalZip}}) already
streams through \{{InflaterInputStream}} and is not changed.
* Same pattern as \{{DiscoveryMessageParser.marshalZip}} in the same package.
h3. Compatibility
The produced bytes are a regular zlib stream, same as before (default
\{{Deflater}} settings in both versions): zlib format unchanged, old nodes
inflate it as is — rolling upgrade safe. The benchmark setup asserts
\{{inflate(new) == inflate(old) == marshal(obj)}}.
h3. Benchmark
JMH, avgt, JDK 17 (corretto), Apple Silicon, fork 1, 3 warmup + 5 measurement
iterations, \{{-prof gc}}; old = current code, new = this change:
||payload (marshalled -> zipped)||time old, us/op||time new, us/op||alloc old,
B/op||alloc new, B/op||
|MAP_2K (1.6 KB -> 0.7 KB)|21.3 ± 11.5|15.4 ± 0.7|12,544|14,368 (+15%)|
|MAP_100K (132 KB -> 40 KB)|1,987 ± 512|1,931 ± 264|1,065,204|283,835 (−73%)|
|LIST_1M (1.66 MB -> 318 KB)|17,661 ± 3,433|16,825 ± 980|9,891,984|1,788,235
(−82%)|
The MAP_2K row is the trade-off, spelled out:
* time is better even there — the new path skips the trim copy of the
marshalled form and the input-sized deflate buffer;
* allocations grow by a *bounded* ~1.8 KB/call: the fixed stream overhead (8 KB
\{{BufferedOutputStream}} + 512 B \{{DeflaterOutputStream}} buffer) replaces
buffers that used to scale with the payload. Rough break-even is ~3 KB of
marshalled data; above it the new code wins and the win grows with size;
* small payloads pass through \{{marshalZip}} rarely and outside any loop
(security credentials on node join, the \{{ZkJoiningNodeData}} marker when join
data is split, communication-error resolve state), while the frequent call site
— the coordinator rewriting \{{ZkDiscoveryEventsData}} on every discovery event
— operates on tens to hundreds of KB in clusters where this path matters at all.
h3. Expected effect
This is a control-plane path, so the point is not per-call latency (time is on
par or better): it is GC pressure and peak heap on the coordinator during mass
joins / topology churn, where join data and events data are marshalled
repeatedly — allocations drop by 73-82% on realistic payloads, peak usage no
longer holds ~3x the uncompressed size, and zlib native memory is released
deterministically.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)