[
https://issues.apache.org/jira/browse/IGNITE-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Abashev updated IGNITE-28613:
----------------------------------
Description:
h2. Summary
{{GridNioSelfTest#testAsyncSendReceive}} logs an {{IgniteCheckedException:
Invalid message size: -269488145}} (= {{0xEFEFEFEF}}) on every run on master.
The JUnit assertion still passes (the latch fires before the exception), so the
test doesn't go red — but the stack trace pollutes the surefire log and breaks
any "no errors during test" CI guard.
The original ticket attributed this to a use-after-free race between
{{GridNioServer.close(...)}} calling {{cleanDirectBuffer}} and a concurrent
{{processRead}}. After investigation that RCA is incorrect — see "Why the
original RCA is wrong" below. The actual cause is a test bug introduced by
IGNITE-23040.
h2. Symptom
Reproducible 100% on master on macOS aarch64:
{noformat}
[ERROR][grid-nio-worker-0-#42%nio-test-grid%][] Closing NIO session because of
unhandled exception.
class org.apache.ignite.IgniteCheckedException: Invalid message size: -269488145
at
org.apache.ignite.internal.util.nio.GridNioServerBuffer.read(GridNioServerBuffer.java:82)
at
org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:69)
at
org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:36)
at
org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:114)
...
at
org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1235)
...
{noformat}
20/20 runs hit it on a recent Apple Silicon JDK 11 build.
h2. Root cause
The {{NioListener}} shared by both {{GridNioServer}} peers in this test sends
an ack on every received message:
{code:java}
//sending ack of processing message
ses.send(new byte[] {(byte)0xDEADBEEF});
{code}
{{(byte)0xDEADBEEF}} truncates to {{0xEF}} — the ack is a *one-byte* message
containing the single byte {{0xEF}}.
The test's {{BufferedParser}} overrides {{encode}} to skip the size prefix:
{code:java}
@Override public ByteBuffer encode(GridNioSession ses, Object msg) {
// IO manager creates array ready to send.
return msg instanceof byte[] ? ByteBuffer.wrap((byte[])msg) :
(ByteBuffer)msg;
}
{code}
So the ack hits the wire as a *single raw byte* {{0xEF}} — no 4-byte size
header.
The same {{NioListener}} is wired to both {{srvr1}} and {{srvr2}}, so every
received message produces an ack going back the other way → ping-pong. Once
four ack bytes accumulate on either side's read buffer, the receiver's
{{GridBufferedParser.decode}} reads them as the next message's 4-byte size
header:
{noformat}
msgSize = 0xEF << 24 | 0xEF << 16 | 0xEF << 8 | 0xEF
= 0xEFEFEFEF
= -269488145 (signed)
{noformat}
{{GridNioServerBuffer.read}} then throws {{Invalid message size: -269488145}}.
The 10 originals (each properly size-prefixed by {{createMessageWithSize()}})
arrive and count down the latch *before* enough ack bytes accumulate, which is
why the assertion still passes.
h2. Why it surfaces now
Introduced by IGNITE-23040 "Fix GridNioSslSelfTest" ([PR
#11505|https://github.com/apache/ignite/pull/11505]), commit {{8782b4882ff}},
24 Sep 2024. That commit added the ack send to fix the *SSL* self-test (where
the peer is a {{TestClient}} that explicitly calls {{client.receiveMessage()}}
to drain the ack), but it modified the *shared* {{NioListener}} rather than
introducing a dedicated one. For {{testAsyncSendReceive}}, where both peers run
a {{GridNioServer}} with the size-prefix-skipping {{BufferedParser}}, the
protocol mismatch produces the spurious error.
Master has shipped this log noise on every run for ~1.5 years. The assertion
never fires because the test's contract (10 originals received) is satisfied
before the parser blows up on the malformed ack stream.
h2. Why the original RCA is wrong
The original ticket attributed the symptom to:
* {{GridNioServer.close(...)}} calling {{cleanDirectBuffer(ses.readBuffer())}}
*before* cancelling {{ses.key()}}, while a worker is still inside
{{processRead(...)}} on that session.
Two reasons this can't be the cause in this test:
# {{testAsyncSendReceive}} doesn't set {{.directMode(true)}} on the
{{GridNioServer.Builder}}. When {{directMode=false}}, {{createSession}} leaves
{{ses.writeBuffer()}}/{{ses.readBuffer()}} {{null}}. The whole
{{cleanDirectBuffer}} block in {{close(...)}} is gated by null-checks — it
never runs for this test.
# Even with {{directMode=true}}, {{close(...)}} and {{processRead(...)}} cannot
run concurrently for the same session. Every {{close}} path goes through the
owning worker thread:
#* Public {{GridNioServer.close(ses)}} — {{offerStateChange(CLOSE)}} → drained
inside the worker's {{bodyInternal}} loop.
#* {{closeFromWorkerThread(ses)}} — only caller is
{{InboundConnectionHandler.onMessage}}, itself running on the worker.
#* {{processRead}}/{{processWrite}} exception handlers — same worker thread.
#* {{processSelectedKeysOptimized}} catch — same worker thread.
#* {{srvr.stop()}} — {{U.cancel(clientWorkers)}} → {{U.join(...)}}; the
per-worker {{finally}} in {{bodyInternal}} performs the closes.
The worker is single-threaded, and {{processSelectedKeysOptimized}} skips keys
with {{!key.isValid()}} — which becomes {{false}} immediately after
{{key.cancel()}}. The race window described in the original RCA does not exist.
h2. Fix
Stop the ack from polluting the wire in {{testAsyncSendReceive}} by giving it
its own listener that doesn't send acks. The default {{NioListener}} ack stays
where it's meaningful ({{testSendReceive}}, {{testMultiThreadedSendReceive}},
the SSL tests — all of which use {{TestClient}} with matching
{{client.receiveMessage()}} reads).
{code:java}
// In GridNioSelfTest.java
// Make NioListener fields visible to the subclass:
private static class NioListener extends GridNioServerListenerAdapter<byte[]> {
protected final AtomicInteger msgCnt = new AtomicInteger(0);
protected final AtomicBoolean sizeFailed = new AtomicBoolean(false);
protected final CountDownLatch latch;
...
}
// New listener for testAsyncSendReceive:
private static class NoAckNioListener extends NioListener {
NoAckNioListener(CountDownLatch latch) { super(latch); }
@Override public void onMessage(GridNioSession ses, byte[] data) {
int expMsgSize = getExpectedMessageSize();
if (data == null || (expMsgSize != 0 && data.length != expMsgSize))
sizeFailed.set(true);
msgCnt.incrementAndGet();
if (latch != null)
latch.countDown();
}
}
// In testAsyncSendReceive:
NioListener lsnr = new NoAckNioListener(latch);
{code}
h2. Acceptance criteria
* 20 consecutive runs of {{GridNioSelfTest#testAsyncSendReceive}} produce no
{{Invalid message size}} entries in the surefire log (currently 20/20 hit it on
macOS aarch64).
* {{testSendReceive}}, {{testMultiThreadedSendReceive}} and the SSL self-tests
stay green — they keep using the ack-sending {{NioListener}}.
h2. Out of scope
* No production code changes. The original ticket's proposed change to
{{GridNioServer.close(...)}} (move {{closeKey}} before {{cleanDirectBuffer}})
is unnecessary because the race it targets cannot happen — see "Why the
original RCA is wrong".
* No changes to {{BufferedParser}} / {{GridBufferedParser}} /
{{GridNioServerBuffer}} — the protocol mismatch is purely a test-side issue.
was:
h2. Summary
{{GridNioServer.close(GridSelectorNioSessionImpl, IgniteCheckedException,
boolean)}} releases the session's direct {{ByteBuffer}}s via
{{GridUnsafe.cleanDirectBuffer(...)}} *before* it cancels the underlying
{{SelectionKey}}. While that close runs, the corresponding
{{AbstractNioClientWorker}} can still be inside {{processRead(...)}} on the
same session — it reads from a buffer whose native memory was just freed and
now contains the allocator's poison pattern ({{0xEFEFEFEF}} on macOS / glibc /
common debug allocators). The buffered parser then sees a 4-byte "size" of
{{-269488145}} and aborts the session with {{IgniteCheckedException: Invalid
message size}}.
The exception is caught by the filter chain and logged at {{ERROR}} level;
functionally the session is already closing, so no test assertion fires. But
the noise pollutes logs, breaks any "no errors during teardown" CI guards, and
the underlying use-after-free is a latent bug — on a different allocator the
freed memory could be reused by another thread and produce silently corrupted
reads.
h2. Symptom
Reproducible (intermittent) on {{GridNioSelfTest#testAsyncSendReceive}} and
other tests that close NIO sessions with traffic still in flight. Stack trace
from a failing run on the current master:
{noformat}
[ERROR][grid-nio-worker-0-#42%nio-test-grid%][] Closing NIO session because of
unhandled exception.
class org.apache.ignite.IgniteCheckedException: Invalid message size: -269488145
at
org.apache.ignite.internal.util.nio.GridNioServerBuffer.read(GridNioServerBuffer.java:82)
at
org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:69)
at
org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:36)
at
org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:114)
at
org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:110)
at
org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:3791)
at
org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:176)
at
org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1235)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2564)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2319)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1950)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}
{{0xEFEFEFEF}} = {{-269488145}} interpreted as a signed int. Java's
{{ByteBuffer.allocateDirect}} returns zero-initialized memory, so this pattern
is *not* coming from the original allocation — it's the allocator's poison
written when the off-heap region was freed.
h2. Root cause
{{GridNioServer.close(...)}} (around lines 2879-2893):
{code:java}
if (ses.setClosed()) {
ses.onClosed();
if (directBuf) {
if (ses.writeBuffer() != null)
GridUnsafe.cleanDirectBuffer(ses.writeBuffer()); // (1) free
off-heap
if (ses.readBuffer() != null)
GridUnsafe.cleanDirectBuffer(ses.readBuffer()); // (1)
}
if (closeSock)
closeKey(ses.key()); // (2) detach
selector
else
ses.key().cancel();
...
}
{code}
The selection key is cancelled at step (2) — that's the only thing that
prevents the worker from selecting this session again. Step (1) runs first, so
there is a window where:
* {{readBuffer}} 's native memory is freed.
* {{key}} is still registered with the selector.
* {{AbstractNioClientWorker}} can still pick up a {{readyOps()}} on this key
from the most recent {{select()}} and call {{processRead(ses)}} →
{{filterChain.onMessageReceived(ses, buf)}} → parser reads the freed bytes.
It's effectively a use-after-free; the {{0xEFEFEFEF}} bytes are the symptom
rather than the cause. With a more aggressive allocator the same window could
surface as silent data corruption rather than a thrown exception, because some
unrelated allocation could land at the same address before the worker reads it.
h2. Why now
The race has been present since {{cleanDirectBuffer}} was added to
{{close(...)}}. It surfaces more often on:
* Apple Silicon / recent macOS — the system allocator hands freed pages back to
the kernel faster, so the poison pattern lands almost immediately.
* JDK 17+ direct-buffer cleaner — runs synchronously on {{cleanDirectBuffer}}
call rather than on GC, narrowing the window for "happens to still hold zeros"
luck.
It is *not* a regression from any recent message-protocol or serializer
refactor (IGNITE-26508, IGNITE-28221, etc.). The same code path predates them.
h2. Reproduction
* Run {{GridNioSelfTest#testAsyncSendReceive}} on macOS arm64 with a recent JDK
11+ build. The JUnit assertion still passes (the latch fires before close), but
the surefire log contains the stack above on a meaningful fraction of runs.
* Synthetic stress: send N messages, immediately {{ses.close()}} from the test
thread while the worker is still draining the receive buffer. {{N=10}} is
enough to hit the race a few times in a hundred.
h2. Proposed fix
Cancel/close the selection key *before* freeing the direct buffers, so the
worker is guaranteed off the key by the time the off-heap memory is released.
Diff:
{code:java}
// GridNioServer.java, inside close(...) after ses.onClosed():
if (closeSock)
closeKey(ses.key());
else
ses.key().cancel();
if (directBuf) {
if (ses.writeBuffer() != null)
GridUnsafe.cleanDirectBuffer(ses.writeBuffer());
if (ses.readBuffer() != null)
GridUnsafe.cleanDirectBuffer(ses.readBuffer());
}
{code}
Note: {{SelectionKey.cancel()}} only marks the key for cancellation; it is
finalized on the next {{select()}}. The worker can still be inside an
in-progress {{processRead}} when we reach the buffer cleanup. To make the fix
airtight we additionally need a happens-before barrier between "no more reads
on this session" and "free the buffer".
Two options for the barrier:
# *Defer the cleanup to GC.* Drop {{cleanDirectBuffer}} from {{close(...)}} and
let the {{Cleaner}} attached to the {{ByteBuffer}} (JDK 9+) reclaim the
off-heap memory when the buffer is unreachable. After
{{ses.removeMeta(BUF_META_KEY)}} and the session reference is dropped, the
buffer becomes garbage like any other. Cost: a slightly later release of native
memory; benefit: no use-after-free is even possible.
# *Synchronize with the owning worker.* Have {{close(...)}} post a
"drain-and-release" task to the worker that owns the session and let the worker
do the {{cleanDirectBuffer}} after its current {{processRead}} finishes.
Heavier, but keeps the eager-release behaviour for users who care about it.
Option 1 is recommended — {{cleanDirectBuffer}} is an optimisation, not a
correctness requirement, and removing the eager free is a one-liner. Acceptance
test below confirms there is no measurable native-memory growth in steady state
because sessions are short-lived and the {{Cleaner}} drains them at GC pace.
h2. Acceptance criteria
* {{GridNioSelfTest#testAsyncSendReceive}} produces no {{Invalid message size}}
entries in the surefire log over 50 consecutive runs (currently 5-15% of runs
hit it).
* No {{IgniteCheckedException}} originating from {{GridNioServerBuffer.read}}
appears in any nio test class on JDK 11/17/21 on Linux x86\_64 and macOS
aarch64.
* JFR / native-memory tracking shows direct-buffer reclaim within one GC cycle
of session close (option 1) — no steady-state growth on a stress test that
opens/closes 100k sessions.
* {{GridCommunicationSelfTest}} and {{IgniteCacheMessageRecoveryAbstractTest}}
suites stay green — they exercise the same close path under load.
h2. Out of scope
* The buffered parser's protocol itself is fine; no need to harden the size
validation. The garbage size value is a symptom, not the bug.
* No changes to {{DirectMessageWriter}} / {{DirectMessageReader}} or to the
message serializer contract — those are separate concerns from session
lifecycle.
> Race in GridNioServer.close: direct buffers are freed before the NIO worker
> stops reading them
> ----------------------------------------------------------------------------------------------
>
> Key: IGNITE-28613
> URL: https://issues.apache.org/jira/browse/IGNITE-28613
> Project: Ignite
> Issue Type: Task
> Reporter: Alex Abashev
> Assignee: Alex Abashev
> Priority: Minor
> Labels: IEP-132, ise
>
> h2. Summary
> {{GridNioSelfTest#testAsyncSendReceive}} logs an {{IgniteCheckedException:
> Invalid message size: -269488145}} (= {{0xEFEFEFEF}}) on every run on master.
> The JUnit assertion still passes (the latch fires before the exception), so
> the test doesn't go red — but the stack trace pollutes the surefire log and
> breaks any "no errors during test" CI guard.
> The original ticket attributed this to a use-after-free race between
> {{GridNioServer.close(...)}} calling {{cleanDirectBuffer}} and a concurrent
> {{processRead}}. After investigation that RCA is incorrect — see "Why the
> original RCA is wrong" below. The actual cause is a test bug introduced by
> IGNITE-23040.
> h2. Symptom
> Reproducible 100% on master on macOS aarch64:
> {noformat}
> [ERROR][grid-nio-worker-0-#42%nio-test-grid%][] Closing NIO session because
> of unhandled exception.
> class org.apache.ignite.IgniteCheckedException: Invalid message size:
> -269488145
> at
> org.apache.ignite.internal.util.nio.GridNioServerBuffer.read(GridNioServerBuffer.java:82)
> at
> org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:69)
> at
> org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:36)
> at
> org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:114)
> ...
> at
> org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1235)
> ...
> {noformat}
> 20/20 runs hit it on a recent Apple Silicon JDK 11 build.
> h2. Root cause
> The {{NioListener}} shared by both {{GridNioServer}} peers in this test sends
> an ack on every received message:
> {code:java}
> //sending ack of processing message
> ses.send(new byte[] {(byte)0xDEADBEEF});
> {code}
> {{(byte)0xDEADBEEF}} truncates to {{0xEF}} — the ack is a *one-byte* message
> containing the single byte {{0xEF}}.
> The test's {{BufferedParser}} overrides {{encode}} to skip the size prefix:
> {code:java}
> @Override public ByteBuffer encode(GridNioSession ses, Object msg) {
> // IO manager creates array ready to send.
> return msg instanceof byte[] ? ByteBuffer.wrap((byte[])msg) :
> (ByteBuffer)msg;
> }
> {code}
> So the ack hits the wire as a *single raw byte* {{0xEF}} — no 4-byte size
> header.
> The same {{NioListener}} is wired to both {{srvr1}} and {{srvr2}}, so every
> received message produces an ack going back the other way → ping-pong. Once
> four ack bytes accumulate on either side's read buffer, the receiver's
> {{GridBufferedParser.decode}} reads them as the next message's 4-byte size
> header:
> {noformat}
> msgSize = 0xEF << 24 | 0xEF << 16 | 0xEF << 8 | 0xEF
> = 0xEFEFEFEF
> = -269488145 (signed)
> {noformat}
> {{GridNioServerBuffer.read}} then throws {{Invalid message size: -269488145}}.
> The 10 originals (each properly size-prefixed by {{createMessageWithSize()}})
> arrive and count down the latch *before* enough ack bytes accumulate, which
> is why the assertion still passes.
> h2. Why it surfaces now
> Introduced by IGNITE-23040 "Fix GridNioSslSelfTest" ([PR
> #11505|https://github.com/apache/ignite/pull/11505]), commit {{8782b4882ff}},
> 24 Sep 2024. That commit added the ack send to fix the *SSL* self-test (where
> the peer is a {{TestClient}} that explicitly calls
> {{client.receiveMessage()}} to drain the ack), but it modified the *shared*
> {{NioListener}} rather than introducing a dedicated one. For
> {{testAsyncSendReceive}}, where both peers run a {{GridNioServer}} with the
> size-prefix-skipping {{BufferedParser}}, the protocol mismatch produces the
> spurious error.
> Master has shipped this log noise on every run for ~1.5 years. The assertion
> never fires because the test's contract (10 originals received) is satisfied
> before the parser blows up on the malformed ack stream.
> h2. Why the original RCA is wrong
> The original ticket attributed the symptom to:
> * {{GridNioServer.close(...)}} calling
> {{cleanDirectBuffer(ses.readBuffer())}} *before* cancelling {{ses.key()}},
> while a worker is still inside {{processRead(...)}} on that session.
> Two reasons this can't be the cause in this test:
> # {{testAsyncSendReceive}} doesn't set {{.directMode(true)}} on the
> {{GridNioServer.Builder}}. When {{directMode=false}}, {{createSession}}
> leaves {{ses.writeBuffer()}}/{{ses.readBuffer()}} {{null}}. The whole
> {{cleanDirectBuffer}} block in {{close(...)}} is gated by null-checks — it
> never runs for this test.
> # Even with {{directMode=true}}, {{close(...)}} and {{processRead(...)}}
> cannot run concurrently for the same session. Every {{close}} path goes
> through the owning worker thread:
> #* Public {{GridNioServer.close(ses)}} — {{offerStateChange(CLOSE)}} →
> drained inside the worker's {{bodyInternal}} loop.
> #* {{closeFromWorkerThread(ses)}} — only caller is
> {{InboundConnectionHandler.onMessage}}, itself running on the worker.
> #* {{processRead}}/{{processWrite}} exception handlers — same worker thread.
> #* {{processSelectedKeysOptimized}} catch — same worker thread.
> #* {{srvr.stop()}} — {{U.cancel(clientWorkers)}} → {{U.join(...)}}; the
> per-worker {{finally}} in {{bodyInternal}} performs the closes.
> The worker is single-threaded, and {{processSelectedKeysOptimized}} skips
> keys with {{!key.isValid()}} — which becomes {{false}} immediately after
> {{key.cancel()}}. The race window described in the original RCA does not
> exist.
> h2. Fix
> Stop the ack from polluting the wire in {{testAsyncSendReceive}} by giving it
> its own listener that doesn't send acks. The default {{NioListener}} ack
> stays where it's meaningful ({{testSendReceive}},
> {{testMultiThreadedSendReceive}}, the SSL tests — all of which use
> {{TestClient}} with matching {{client.receiveMessage()}} reads).
> {code:java}
> // In GridNioSelfTest.java
> // Make NioListener fields visible to the subclass:
> private static class NioListener extends GridNioServerListenerAdapter<byte[]>
> {
> protected final AtomicInteger msgCnt = new AtomicInteger(0);
> protected final AtomicBoolean sizeFailed = new AtomicBoolean(false);
> protected final CountDownLatch latch;
> ...
> }
> // New listener for testAsyncSendReceive:
> private static class NoAckNioListener extends NioListener {
> NoAckNioListener(CountDownLatch latch) { super(latch); }
> @Override public void onMessage(GridNioSession ses, byte[] data) {
> int expMsgSize = getExpectedMessageSize();
> if (data == null || (expMsgSize != 0 && data.length != expMsgSize))
> sizeFailed.set(true);
> msgCnt.incrementAndGet();
> if (latch != null)
> latch.countDown();
> }
> }
> // In testAsyncSendReceive:
> NioListener lsnr = new NoAckNioListener(latch);
> {code}
> h2. Acceptance criteria
> * 20 consecutive runs of {{GridNioSelfTest#testAsyncSendReceive}} produce no
> {{Invalid message size}} entries in the surefire log (currently 20/20 hit it
> on macOS aarch64).
> * {{testSendReceive}}, {{testMultiThreadedSendReceive}} and the SSL
> self-tests stay green — they keep using the ack-sending {{NioListener}}.
> h2. Out of scope
> * No production code changes. The original ticket's proposed change to
> {{GridNioServer.close(...)}} (move {{closeKey}} before {{cleanDirectBuffer}})
> is unnecessary because the race it targets cannot happen — see "Why the
> original RCA is wrong".
> * No changes to {{BufferedParser}} / {{GridBufferedParser}} /
> {{GridNioServerBuffer}} — the protocol mismatch is purely a test-side issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)