[jira] [Updated] (IGNITE-20301) ItIgniteInMemoryNodeRestartTest tests are flaky

Roman Puchkovskiy (Jira) Wed, 01 Nov 2023 04:17:22 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roman Puchkovskiy updated IGNITE-20301:
---------------------------------------
    Description: 
inMemoryNodeRestartNotLeader and inMemoryNodeFullPartitionRestart fail 50/50 on 
my machine.

For both tests, the following is seen in the console when they fail:

[2023-08-29T11:20:35,053][ERROR][%iiimnrt_imnfpr_3344%JRaft-Common-Executor-0][LogThreadPoolExecutor]
 Uncaught exception in pool: %iiimnrt_imnfpr_3344%JRaft-Common-Executor-, 
org.apache.ignite.raft.jraft.util.MetricThreadPoolExecutor@76504d30[Running, 
pool size = 4, active threads = 2, queued tasks = 0, completed tasks = 1601].
 java.lang.IllegalArgumentException: null
    at org.apache.ignite.raft.jraft.util.Requires.requireTrue(Requires.java:64) 
~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.fillCommonFields(Replicator.java:1553)
 ~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1631) 
~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1601) 
~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.continueSending(Replicator.java:983)
 ~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.lambda$waitMoreEntries$9(Replicator.java:1583)
 ~[main/:?]
    at 
org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.runOnNewLog(LogManagerImpl.java:1197)
 ~[main/:?]
    at 
org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.lambda$wakeupAllWaiter$6(LogManagerImpl.java:398)
 ~[main/:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 
~[?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
    at java.lang.Thread.run(Thread.java:834) ~[?:?]

 

{*}Update{*}: as of 2023-11-01, these failures can not be reproduced anymore, 
but others can be seen (with much lower frequency, approx 1 in a few dozen 
runs).

These happen for the following reason. When we recover an in-memory partition, 
we want to ask other participants of the group whether they have data in this 
partition or not. If we don't have the corresponding nodes in our local 
physical topology, we treat this as if they did not have the data. So if we 
recover a partition before SWIM has managed to fully deliver information about 
other cluster nodes, we skip those nodes, so we might think there is no 
majority and fail the partition start.

A possible solution is to wait till the nodes appear in the topology, up to 
some timeout (like 3 seconds).

  was:
inMemoryNodeRestartNotLeader and inMemoryNodeFullPartitionRestart fail 50/50 on 
my machine.

For both tests, the following is seen in the console when they fail:

[2023-08-29T11:20:35,053][ERROR][%iiimnrt_imnfpr_3344%JRaft-Common-Executor-0][LogThreadPoolExecutor]
 Uncaught exception in pool: %iiimnrt_imnfpr_3344%JRaft-Common-Executor-, 
org.apache.ignite.raft.jraft.util.MetricThreadPoolExecutor@76504d30[Running, 
pool size = 4, active threads = 2, queued tasks = 0, completed tasks = 1601].
 java.lang.IllegalArgumentException: null
    at org.apache.ignite.raft.jraft.util.Requires.requireTrue(Requires.java:64) 
~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.fillCommonFields(Replicator.java:1553)
 ~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1631) 
~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1601) 
~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.continueSending(Replicator.java:983)
 ~[main/:?]
    at 
org.apache.ignite.raft.jraft.core.Replicator.lambda$waitMoreEntries$9(Replicator.java:1583)
 ~[main/:?]
    at 
org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.runOnNewLog(LogManagerImpl.java:1197)
 ~[main/:?]
    at 
org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.lambda$wakeupAllWaiter$6(LogManagerImpl.java:398)
 ~[main/:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 
~[?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
    at java.lang.Thread.run(Thread.java:834) ~[?:?]

 

{*}Update{*}: as of 2023-11-01, these failures do not reproduce, but others can 
be seen (with much lower frequency, approx 1 in a few dozen runs). These happen 
for the following reason. When we recover an in-memory partition, we want to 
ask other participants of the group whether they have data in this partition or 
not. If we don't have the corresponding nodes in our local physical topology, 
we treat this as if they did not have the data. So if we recover a partition 
before SWIM has managed to fully deliver information about other cluster nodes, 
we skip those nodes, so we might think there is no majority and fail the 
partition start.

A possible solution is to wait till the nodes appear in the topology, up to 
some timeout (like 3 seconds).


> ItIgniteInMemoryNodeRestartTest tests are flaky
> -----------------------------------------------
>
>                 Key: IGNITE-20301
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20301
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3, tech-debt
>             Fix For: 3.0.0-beta2
>
>
> inMemoryNodeRestartNotLeader and inMemoryNodeFullPartitionRestart fail 50/50 
> on my machine.
> For both tests, the following is seen in the console when they fail:
> [2023-08-29T11:20:35,053][ERROR][%iiimnrt_imnfpr_3344%JRaft-Common-Executor-0][LogThreadPoolExecutor]
>  Uncaught exception in pool: %iiimnrt_imnfpr_3344%JRaft-Common-Executor-, 
> org.apache.ignite.raft.jraft.util.MetricThreadPoolExecutor@76504d30[Running, 
> pool size = 4, active threads = 2, queued tasks = 0, completed tasks = 1601].
>  java.lang.IllegalArgumentException: null
>     at 
> org.apache.ignite.raft.jraft.util.Requires.requireTrue(Requires.java:64) 
> ~[main/:?]
>     at 
> org.apache.ignite.raft.jraft.core.Replicator.fillCommonFields(Replicator.java:1553)
>  ~[main/:?]
>     at 
> org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1631)
>  ~[main/:?]
>     at 
> org.apache.ignite.raft.jraft.core.Replicator.sendEntries(Replicator.java:1601)
>  ~[main/:?]
>     at 
> org.apache.ignite.raft.jraft.core.Replicator.continueSending(Replicator.java:983)
>  ~[main/:?]
>     at 
> org.apache.ignite.raft.jraft.core.Replicator.lambda$waitMoreEntries$9(Replicator.java:1583)
>  ~[main/:?]
>     at 
> org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.runOnNewLog(LogManagerImpl.java:1197)
>  ~[main/:?]
>     at 
> org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.lambda$wakeupAllWaiter$6(LogManagerImpl.java:398)
>  ~[main/:?]
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:834) ~[?:?]
>  
> {*}Update{*}: as of 2023-11-01, these failures can not be reproduced anymore, 
> but others can be seen (with much lower frequency, approx 1 in a few dozen 
> runs).
> These happen for the following reason. When we recover an in-memory 
> partition, we want to ask other participants of the group whether they have 
> data in this partition or not. If we don't have the corresponding nodes in 
> our local physical topology, we treat this as if they did not have the data. 
> So if we recover a partition before SWIM has managed to fully deliver 
> information about other cluster nodes, we skip those nodes, so we might think 
> there is no majority and fail the partition start.
> A possible solution is to wait till the nodes appear in the topology, up to 
> some timeout (like 3 seconds).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-20301) ItIgniteInMemoryNodeRestartTest tests are flaky

Reply via email to