[ 
https://issues.apache.org/jira/browse/IGNITE-28255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Abashev updated IGNITE-28255:
----------------------------------
    Description:     (was: Summary:
MarshallerCacheJobRunNodeRestartTest.testJobRun fails intermittently with 
timeout

Description:

The test MarshallerCacheJobRunNodeRestartTest.testJobRun hangs and times out 
after 5 minutes (300 000 ms).

TC link: 
https://ci2.ignite.apache.org/test/381112157178694638?currentProjectId=IgniteTests24Java8&branch=%3Cdefault%3E

Failure rate: 2 failures out of 68 runs (~3%), both on aitc-lin15, branch 
refs/heads/master (builds #41053, #41049)

Root cause (from thread dump):

The test runner thread 
test-runner-#83435%cache.MarshallerCacheJobRunNodeRestartTest% is stuck in 
WAITING state inside GridTestUtils.runMultiThreaded() at Thread.join(), waiting 
for worker threads that never finish:

Thread [name="test-runner-#83435%cache.MarshallerCacheJobRunNodeRestartTest%", 
state=WAITING]
  at java.lang.Object.wait(Native Method)
  at java.lang.Thread.join(Thread.java:1304)
  at o.a.i.testframework.GridTestUtils.runMultiThreaded(GridTestUtils.java:1124)
  at 
o.a.i.i.processors.cache.MarshallerCacheJobRunNodeRestartTest.testJobRun(MarshallerCacheJobRunNodeRestartTest.java:65)

The main thread holds multiple ReentrantReadWriteLock instances (13 locked 
synchronizers visible in the dump).

Additionally, a suspicious warning appears in the log just before the hang:

Joining node doesn't have stored group keys 
[node=03e08542-cd7b-4a95-a9fe-bae553f00004]

This suggests a worker thread may be stuck waiting for group key exchange to 
complete during node restart, which never finishes — causing the entire 
runMultiThreaded call to hang indefinitely.

Environment:
- Ignite version: 2.18.0-SNAPSHOT#20260317
- JVM: OpenJDK 17.0.8.1+1 Eclipse Adoptium
- OS: Linux 5.4.0-216-generic amd64
- Agent: aitc-lin15

Steps to investigate:
1. Check why the restarted node doesn't have stored group keys — is the key 
exchange protocol completing correctly during 
MarshallerCacheJobRunNodeRestartTest?
2. Identify which worker thread inside runMultiThreaded is blocked and why it 
never returns
3. Check for a race condition between node restart and group key propagation in 
the marshaller cache)

> Fix java.io.NotSerializableException: 
> org.apache.ignite.internal.processors.marshaller.MarshallerMappingItem
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-28255
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28255
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alex Abashev
>            Assignee: Alex Abashev
>            Priority: Major
>              Labels: IEP-132
>             Fix For: 2.19
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to