[ https://issues.apache.org/jira/browse/IGNITE-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roman Puchkovskiy updated IGNITE-20301: --------------------------------------- Description: inMemoryNodeRestartNotLeader and inMemoryNodeFullPartitionRestart fail 50/50 on my machine. For both tests, the following is seen in the console when they fail: [2023-08-29T11:20:35,053][ERROR][%iiimnrt_imnfpr_3344%JRaft-Common-Executor-0][LogThreadPoolExecutor] Uncaught exception in pool: %iiimnrt_imnfpr_3344%JRaft-Common-Executor-, org.apache.ignite.raft.jraft.util.MetricThreadPoolExecutor@76504d30[Running, pool size = 4, active threads = 2, queued tasks = 0, completed tasks = 1601]. java.lang.IllegalArgumentException: null at org.apache.ignite.raft.jraft.util.Requires.requireTrue(Requires.java:64) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.fillCommonFields(Replicator.java:1553) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1631) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1601) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.continueSending(Replicator.java:983) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.lambda$waitMoreEntries$9(Replicator.java:1583) ~[main/:?] at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.runOnNewLog(LogManagerImpl.java:1197) ~[main/:?] at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.lambda$wakeupAllWaiter$6(LogManagerImpl.java:398) ~[main/:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:834) ~[?:?] {*}Update{*}: as of 2023-11-01, these failures can not be reproduced anymore, but others can be seen (with much lower frequency, approx 1 in a few dozen runs). These happen for the following reason. When we recover an in-memory partition, we want to ask other participants of the group whether they have data in this partition or not. If we don't have the corresponding nodes in our local physical topology, we treat this as if they did not have the data. So if we recover a partition before SWIM has managed to fully deliver information about other cluster nodes, we skip those nodes, so we might think there is no majority and fail the partition start. A possible solution is to wait till the nodes appear in the topology, up to some timeout (like 3 seconds). was: inMemoryNodeRestartNotLeader and inMemoryNodeFullPartitionRestart fail 50/50 on my machine. For both tests, the following is seen in the console when they fail: [2023-08-29T11:20:35,053][ERROR][%iiimnrt_imnfpr_3344%JRaft-Common-Executor-0][LogThreadPoolExecutor] Uncaught exception in pool: %iiimnrt_imnfpr_3344%JRaft-Common-Executor-, org.apache.ignite.raft.jraft.util.MetricThreadPoolExecutor@76504d30[Running, pool size = 4, active threads = 2, queued tasks = 0, completed tasks = 1601]. java.lang.IllegalArgumentException: null at org.apache.ignite.raft.jraft.util.Requires.requireTrue(Requires.java:64) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.fillCommonFields(Replicator.java:1553) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1631) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1601) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.continueSending(Replicator.java:983) ~[main/:?] at org.apache.ignite.raft.jraft.core.Replicator.lambda$waitMoreEntries$9(Replicator.java:1583) ~[main/:?] at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.runOnNewLog(LogManagerImpl.java:1197) ~[main/:?] at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.lambda$wakeupAllWaiter$6(LogManagerImpl.java:398) ~[main/:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:834) ~[?:?] {*}Update{*}: as of 2023-11-01, these failures do not reproduce, but others can be seen (with much lower frequency, approx 1 in a few dozen runs). These happen for the following reason. When we recover an in-memory partition, we want to ask other participants of the group whether they have data in this partition or not. If we don't have the corresponding nodes in our local physical topology, we treat this as if they did not have the data. So if we recover a partition before SWIM has managed to fully deliver information about other cluster nodes, we skip those nodes, so we might think there is no majority and fail the partition start. A possible solution is to wait till the nodes appear in the topology, up to some timeout (like 3 seconds). > ItIgniteInMemoryNodeRestartTest tests are flaky > ----------------------------------------------- > > Key: IGNITE-20301 > URL: https://issues.apache.org/jira/browse/IGNITE-20301 > Project: Ignite > Issue Type: Bug > Reporter: Roman Puchkovskiy > Assignee: Roman Puchkovskiy > Priority: Major > Labels: ignite-3, tech-debt > Fix For: 3.0.0-beta2 > > > inMemoryNodeRestartNotLeader and inMemoryNodeFullPartitionRestart fail 50/50 > on my machine. > For both tests, the following is seen in the console when they fail: > [2023-08-29T11:20:35,053][ERROR][%iiimnrt_imnfpr_3344%JRaft-Common-Executor-0][LogThreadPoolExecutor] > Uncaught exception in pool: %iiimnrt_imnfpr_3344%JRaft-Common-Executor-, > org.apache.ignite.raft.jraft.util.MetricThreadPoolExecutor@76504d30[Running, > pool size = 4, active threads = 2, queued tasks = 0, completed tasks = 1601]. > java.lang.IllegalArgumentException: null > at > org.apache.ignite.raft.jraft.util.Requires.requireTrue(Requires.java:64) > ~[main/:?] > at > org.apache.ignite.raft.jraft.core.Replicator.fillCommonFields(Replicator.java:1553) > ~[main/:?] > at > org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1631) > ~[main/:?] > at > org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1601) > ~[main/:?] > at > org.apache.ignite.raft.jraft.core.Replicator.continueSending(Replicator.java:983) > ~[main/:?] > at > org.apache.ignite.raft.jraft.core.Replicator.lambda$waitMoreEntries$9(Replicator.java:1583) > ~[main/:?] > at > org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.runOnNewLog(LogManagerImpl.java:1197) > ~[main/:?] > at > org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.lambda$wakeupAllWaiter$6(LogManagerImpl.java:398) > ~[main/:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > at java.lang.Thread.run(Thread.java:834) ~[?:?] > > {*}Update{*}: as of 2023-11-01, these failures can not be reproduced anymore, > but others can be seen (with much lower frequency, approx 1 in a few dozen > runs). > These happen for the following reason. When we recover an in-memory > partition, we want to ask other participants of the group whether they have > data in this partition or not. If we don't have the corresponding nodes in > our local physical topology, we treat this as if they did not have the data. > So if we recover a partition before SWIM has managed to fully deliver > information about other cluster nodes, we skip those nodes, so we might think > there is no majority and fail the partition start. > A possible solution is to wait till the nodes appear in the topology, up to > some timeout (like 3 seconds). -- This message was sent by Atlassian Jira (v8.20.10#820010)