[ 
https://issues.apache.org/jira/browse/IGNITE-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Abashev updated IGNITE-28613:
----------------------------------
    Summary: GridNioSelfTest#testAsyncSendReceive: spurious "Invalid message 
size" in surefire log from listener ack pingpong  (was: Race in 
GridNioServer.close: direct buffers are freed before the NIO worker stops 
reading them)

> GridNioSelfTest#testAsyncSendReceive: spurious "Invalid message size" in 
> surefire log from listener ack pingpong
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-28613
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28613
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Alex Abashev
>            Assignee: Alex Abashev
>            Priority: Minor
>              Labels: IEP-132, ise
>
> h2. Summary
> {{GridNioSelfTest#testAsyncSendReceive}} logs an {{IgniteCheckedException: 
> Invalid message size: -269488145}} (= {{0xEFEFEFEF}}) on every run on master. 
> The JUnit assertion still passes (the latch fires before the exception), so 
> the test doesn't go red — but the stack trace pollutes the surefire log and 
> breaks any "no errors during test" CI guard.
> The original ticket attributed this to a use-after-free race between 
> {{GridNioServer.close(...)}} calling {{cleanDirectBuffer}} and a concurrent 
> {{processRead}}. After investigation that RCA is incorrect — see "Why the 
> original RCA is wrong" below. The actual cause is a test bug introduced by 
> IGNITE-23040.
> h2. Symptom
> Reproducible 100% on master on macOS aarch64:
> {noformat}
> [ERROR][grid-nio-worker-0-#42%nio-test-grid%][] Closing NIO session because 
> of unhandled exception.
> class org.apache.ignite.IgniteCheckedException: Invalid message size: 
> -269488145
>     at 
> org.apache.ignite.internal.util.nio.GridNioServerBuffer.read(GridNioServerBuffer.java:82)
>     at 
> org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:69)
>     at 
> org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:36)
>     at 
> org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:114)
>     ...
>     at 
> org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1235)
>     ...
> {noformat}
> 20/20 runs hit it on a recent Apple Silicon JDK 11 build.
> h2. Root cause
> The {{NioListener}} shared by both {{GridNioServer}} peers in this test sends 
> an ack on every received message:
> {code:java}
> //sending ack of processing message
> ses.send(new byte[] {(byte)0xDEADBEEF});
> {code}
> {{(byte)0xDEADBEEF}} truncates to {{0xEF}} — the ack is a *one-byte* message 
> containing the single byte {{0xEF}}.
> The test's {{BufferedParser}} overrides {{encode}} to skip the size prefix:
> {code:java}
> @Override public ByteBuffer encode(GridNioSession ses, Object msg) {
>     // IO manager creates array ready to send.
>     return msg instanceof byte[] ? ByteBuffer.wrap((byte[])msg) : 
> (ByteBuffer)msg;
> }
> {code}
> So the ack hits the wire as a *single raw byte* {{0xEF}} — no 4-byte size 
> header.
> The same {{NioListener}} is wired to both {{srvr1}} and {{srvr2}}, so every 
> received message produces an ack going back the other way → ping-pong. Once 
> four ack bytes accumulate on either side's read buffer, the receiver's 
> {{GridBufferedParser.decode}} reads them as the next message's 4-byte size 
> header:
> {noformat}
> msgSize = 0xEF << 24 | 0xEF << 16 | 0xEF << 8 | 0xEF
>         = 0xEFEFEFEF
>         = -269488145 (signed)
> {noformat}
> {{GridNioServerBuffer.read}} then throws {{Invalid message size: -269488145}}.
> The 10 originals (each properly size-prefixed by {{createMessageWithSize()}}) 
> arrive and count down the latch *before* enough ack bytes accumulate, which 
> is why the assertion still passes.
> h2. Why it surfaces now
> Introduced by IGNITE-23040 "Fix GridNioSslSelfTest" ([PR 
> #11505|https://github.com/apache/ignite/pull/11505]), commit {{8782b4882ff}}, 
> 24 Sep 2024. That commit added the ack send to fix the *SSL* self-test (where 
> the peer is a {{TestClient}} that explicitly calls 
> {{client.receiveMessage()}} to drain the ack), but it modified the *shared* 
> {{NioListener}} rather than introducing a dedicated one. For 
> {{testAsyncSendReceive}}, where both peers run a {{GridNioServer}} with the 
> size-prefix-skipping {{BufferedParser}}, the protocol mismatch produces the 
> spurious error.
> Master has shipped this log noise on every run for ~1.5 years. The assertion 
> never fires because the test's contract (10 originals received) is satisfied 
> before the parser blows up on the malformed ack stream.
> h2. Why the original RCA is wrong
> The original ticket attributed the symptom to:
> * {{GridNioServer.close(...)}} calling 
> {{cleanDirectBuffer(ses.readBuffer())}} *before* cancelling {{ses.key()}}, 
> while a worker is still inside {{processRead(...)}} on that session.
> Two reasons this can't be the cause in this test:
> # {{testAsyncSendReceive}} doesn't set {{.directMode(true)}} on the 
> {{GridNioServer.Builder}}. When {{directMode=false}}, {{createSession}} 
> leaves {{ses.writeBuffer()}}/{{ses.readBuffer()}} {{null}}. The whole 
> {{cleanDirectBuffer}} block in {{close(...)}} is gated by null-checks — it 
> never runs for this test.
> # Even with {{directMode=true}}, {{close(...)}} and {{processRead(...)}} 
> cannot run concurrently for the same session. Every {{close}} path goes 
> through the owning worker thread:
> #* Public {{GridNioServer.close(ses)}} — {{offerStateChange(CLOSE)}} → 
> drained inside the worker's {{bodyInternal}} loop.
> #* {{closeFromWorkerThread(ses)}} — only caller is 
> {{InboundConnectionHandler.onMessage}}, itself running on the worker.
> #* {{processRead}}/{{processWrite}} exception handlers — same worker thread.
> #* {{processSelectedKeysOptimized}} catch — same worker thread.
> #* {{srvr.stop()}} — {{U.cancel(clientWorkers)}} → {{U.join(...)}}; the 
> per-worker {{finally}} in {{bodyInternal}} performs the closes.
> The worker is single-threaded, and {{processSelectedKeysOptimized}} skips 
> keys with {{!key.isValid()}} — which becomes {{false}} immediately after 
> {{key.cancel()}}. The race window described in the original RCA does not 
> exist.
> h2. Fix
> Stop the ack from polluting the wire in {{testAsyncSendReceive}} by giving it 
> its own listener that doesn't send acks. The default {{NioListener}} ack 
> stays where it's meaningful ({{testSendReceive}}, 
> {{testMultiThreadedSendReceive}}, the SSL tests — all of which use 
> {{TestClient}} with matching {{client.receiveMessage()}} reads).
> {code:java}
> // In GridNioSelfTest.java
> // Make NioListener fields visible to the subclass:
> private static class NioListener extends GridNioServerListenerAdapter<byte[]> 
> {
>     protected final AtomicInteger msgCnt = new AtomicInteger(0);
>     protected final AtomicBoolean sizeFailed = new AtomicBoolean(false);
>     protected final CountDownLatch latch;
>     ...
> }
> // New listener for testAsyncSendReceive:
> private static class NoAckNioListener extends NioListener {
>     NoAckNioListener(CountDownLatch latch) { super(latch); }
>     @Override public void onMessage(GridNioSession ses, byte[] data) {
>         int expMsgSize = getExpectedMessageSize();
>         if (data == null || (expMsgSize != 0 && data.length != expMsgSize))
>             sizeFailed.set(true);
>         msgCnt.incrementAndGet();
>         if (latch != null)
>             latch.countDown();
>     }
> }
> // In testAsyncSendReceive:
> NioListener lsnr = new NoAckNioListener(latch);
> {code}
> h2. Acceptance criteria
> * 20 consecutive runs of {{GridNioSelfTest#testAsyncSendReceive}} produce no 
> {{Invalid message size}} entries in the surefire log (currently 20/20 hit it 
> on macOS aarch64).
> * {{testSendReceive}}, {{testMultiThreadedSendReceive}} and the SSL 
> self-tests stay green — they keep using the ack-sending {{NioListener}}.
> h2. Out of scope
> * No production code changes. The original ticket's proposed change to 
> {{GridNioServer.close(...)}} (move {{closeKey}} before {{cleanDirectBuffer}}) 
> is unnecessary because the race it targets cannot happen — see "Why the 
> original RCA is wrong".
> * No changes to {{BufferedParser}} / {{GridBufferedParser}} / 
> {{GridNioServerBuffer}} — the protocol mismatch is purely a test-side issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to