[ 
https://issues.apache.org/jira/browse/IGNITE-18448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653842#comment-17653842
 ] 

Sergey Uttsel commented on IGNITE-18448:
----------------------------------------

I run tests from https://github.com/apache/ignite-3/pull/1465 at main branch. I 
encountered the stack trace from description and I see that there is no 
deadlock because in 'DistributionZoneManager#initMetaStorageKeysOnStart' sync 
invocation 'metaStorageManager.get(zonesLogicalTopologyVersionKey()).get()' is 
failed with TimeoutException, unblock DistributionZoneManager#busyLock and so 
on.
Also in https://github.com/apache/ignite-3/pull/1426 I removed sync invocation 
of '.get()' on future from metastorage which was the root cause of the issue.

> Deadlock on node stop.
> ----------------------
>
>                 Key: IGNITE-18448
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18448
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Andrey Mashenkov
>            Priority: Major
>              Labels: ignite-3
>
> Two threads fall into deadlock when trying to remove nodes from collection.
> See stacktraces below.
> 1. ConcurrentHashMap.compute methods must not use blocking operations.
> 2. IgnitionImpl.doStart() adds a node to _readyForInitNodes_ collection under 
> race. 
> The call _nodeToStart.start(cfgContent)_ returns a future may fails before 
> the _nodeToStart_ object will be added to collection.
> 3. If the future (_nodeToStart.start(cfgContent)_) fails, it possible, some 
> components are started and hold resources, which will never be released. 
> Seems, in case of failure, _nodeToStart.stop()_ has to be called.
> {noformat}
> "%node1%Raft-Group-Client-12@21385" prio=5 tid=0x127e nid=NA waiting for 
> monitor entry
>   java.lang.Thread.State: BLOCKED
>        waiting for Test worker@1 to release lock on <0x59df> (a 
> java.util.concurrent.ConcurrentHashMap$Node)
>         at 
> java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1122)
>         at 
> java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1102)
>         at 
> org.apache.ignite.internal.app.IgnitionImpl.handleStartException(IgnitionImpl.java:235)
> {noformat}
> {noformat}
> "Test worker@1" prio=5 tid=0x1 nid=NA sleeping
>   java.lang.Thread.State: TIMED_WAITING
>        blocks %node1%Raft-Group-Client-12@21385
>         at java.lang.Thread.sleep(Thread.java:-1)
>         at 
> org.apache.ignite.internal.util.IgniteSpinReadWriteLock.writeLock(IgniteSpinReadWriteLock.java:255)
>         at 
> org.apache.ignite.internal.util.IgniteSpinBusyLock.block(IgniteSpinBusyLock.java:68)
>         at 
> org.apache.ignite.internal.distributionzones.DistributionZoneManager.stop(DistributionZoneManager.java:288)
>         at 
> org.apache.ignite.internal.app.LifecycleManager.lambda$stopAllComponents$1(LifecycleManager.java:133)
>         at 
> org.apache.ignite.internal.app.LifecycleManager$$Lambda$3032.1586776480.accept(Unknown
>  Source:-1)
>         at java.util.Iterator.forEachRemaining(Iterator.java:133)
>         at 
> org.apache.ignite.internal.app.LifecycleManager.stopAllComponents(LifecycleManager.java:131)
>         - locked <0x59de> (a org.apache.ignite.internal.app.LifecycleManager)
>         at 
> org.apache.ignite.internal.app.LifecycleManager.stopNode(LifecycleManager.java:115)
>         at org.apache.ignite.internal.app.IgniteImpl.stop(IgniteImpl.java:642)
>         at 
> org.apache.ignite.internal.app.IgnitionImpl.lambda$stop$0(IgnitionImpl.java:145)
>         at 
> org.apache.ignite.internal.app.IgnitionImpl$$Lambda$3001.460355950.apply(Unknown
>  Source:-1)
>         at 
> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1822)
>         - locked <0x59df> (a java.util.concurrent.ConcurrentHashMap$Node)
>         at 
> org.apache.ignite.internal.app.IgnitionImpl.stop(IgnitionImpl.java:144)
>         at org.apache.ignite.IgnitionManager.stop(IgnitionManager.java:116)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to