[ 
https://issues.apache.org/jira/browse/IGNITE-27962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandr Chesnokov updated IGNITE-27962:
-----------------------------------------
    Attachment: Deadlock_Scenario_Filtered (1).txt

> IgniteLock may hang forever in busy-wait during node stop
> ---------------------------------------------------------
>
>                 Key: IGNITE-27962
>                 URL: https://issues.apache.org/jira/browse/IGNITE-27962
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.17
>            Reporter: Aleksandr Chesnokov
>            Priority: Major
>         Attachments: Deadlock_Scenario_Filtered (1).txt
>
>
> *Summary*
> Ignite 2.17.0: Distributed reentrant lock may hang forever during Kubernetes 
> rolling upgrade when {{{}failoverSafe=true{}}}.
> *Environment*
>  * Apache Ignite 2.17.0
>  * Kubernetes deployment
>  * Backend pods = Ignite server nodes
>  * Frontend pods = thick clients
>  * Rolling upgrade (server nodes restarted one-by-one)
> *Problem*
> During a rolling upgrade, a thick client may hang indefinitely while calling:
>  * {{lock()}}
>  * {{unlock()}}
>  * {{tryLock(timeout)}}
> This happens when a server node is stopped at a specific moment during lock 
> acquisition.
> *Expected behavior*
> When a server node leaves the cluster, the client should either:
>  * recover automatically, or
>  * fail fast with an exception.
> {{tryLock(timeout)}} should respect the timeout.
> *Actual behavior*
> The client thread enters a busy-wait loop inside {{GridCacheLockImpl}} and 
> never resumes.
> Ignite does not recover from this state.
> All other threads trying to acquire the same lock also become blocked, 
> leading to full system degradation. The client must be restarted.
> *Suspected root cause*
> Lock acquisition flow:
>  # Node A calls {{{}ignite.reentrantLock(..., failoverSafe=true).lock(){}}}.
>  # Node A commits a pessimistic transaction to acquire the lock.
>  # Node A enters a busy-wait loop waiting for an “ack” message.
>  # The ack is delivered via a continuous query update (observed as 
> {{{}TOPIC_CONTINUOUS{}}}).
>  # If Node B (responsible for sending this update) is stopped via 
> {{Ignition.stop(..., cancel=true)}} before sending the message, the ack is 
> never emitted.
>  # Node A remains in the busy-wait loop forever.
> The same issue may occur during {{{}unlock(){}}}.
> *Additional notes*
>  * {{failoverSafe=true}} does not prevent the issue.
>  * Happens with both {{ShutdownPolicy.IMMEDIATE}} and {{{}GRACEFUL{}}}.
>  * Cluster uses Kubernetes headless service for discovery.
> *Impact*
> Critical. Causes indefinite hang and complete degradation of lock-related 
> operations.
> *Source*
> Reported on the Ignite user mailing list, 24 Feb 2026 
> https://lists.apache.org/thread/tyz91fskkt9klmpyn1jn249myvpzt8l0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to