[ https://issues.apache.org/jira/browse/IGNITE-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vyacheslav Koptilin updated IGNITE-10511: ----------------------------------------- Description: See attached thread dump: disco-event-worker hangs on removeExplicitNodeLocks() on GridCacheMapEntry which is held by GridDistributedTxRemoteAdapter acquired in GridCacheMapEntry.innerSet(). CacheObjectBinaryProcessorImpl is waiting on metadata message on discovery, which can be processed due to disco-event-worker is stuck. Possible fix: {code:java} public void onNodeLeft(final ClusterNode node) { if (isDone() || !enterBusy()) return; cctx.mvcc().removeExplicitNodeLocks(node.id(), initialVersion()); try { onDiscoveryEvent(new IgniteRunnable() { @Override public void run() { if (isDone() || !enterBusy()) return; ... } }); } finally { ... } } {code} As we can see most of the processing is done async in IgniteRunnable() in exchange-worker. We can move {code:java} cctx.mvcc().removeExplicitNodeLocks(node.id(), initialVersion()); {code} inside this Runnable's body. was: See attached thread dump: disco-event-worker hangs on removeExplicitNodeLocks() on GridCacheMapEntry which is held by GridDistributedTxRemoteAdapter acquired in GridCacheMapEntry.innerSet(). CacheObjectBinaryProcessorImpl is waiting on metadata message on discovery, which can be processed due to disco-event-worker is stuck. Possible fix: {quote}public void onNodeLeft(final ClusterNode node) { if (isDone() || !enterBusy()) return; cctx.mvcc().removeExplicitNodeLocks(node.id(), initialVersion()); try { onDiscoveryEvent(new IgniteRunnable() { @Override public void run() { if (isDone() || !enterBusy()) return; ... {quote} As we can see most of the processing is done async in IgniteRunnable() in exchange-worker. We can move {quote}cctx.mvcc().removeExplicitNodeLocks(node.id(), initialVersion()); {quote} inside this Runnable's body. > disco-event-worker can be deadlocked by BinaryContext.metadata running is sys > striped pool waiting for cache entry lock > ----------------------------------------------------------------------------------------------------------------------- > > Key: IGNITE-10511 > URL: https://issues.apache.org/jira/browse/IGNITE-10511 > Project: Ignite > Issue Type: Bug > Reporter: Pavel Voronkin > Priority: Major > Attachments: race.txt > > > See attached thread dump: > disco-event-worker hangs on removeExplicitNodeLocks() on GridCacheMapEntry > which is held by GridDistributedTxRemoteAdapter acquired in > GridCacheMapEntry.innerSet(). > CacheObjectBinaryProcessorImpl is waiting on metadata message on discovery, > which can be processed due to disco-event-worker is stuck. > Possible fix: > {code:java} > public void onNodeLeft(final ClusterNode node) { > if (isDone() || !enterBusy()) > return; > cctx.mvcc().removeExplicitNodeLocks(node.id(), initialVersion()); > try { > onDiscoveryEvent(new IgniteRunnable() { > @Override public void run() { > if (isDone() || !enterBusy()) > return; > > ... > } > }); > } > finally { > ... > } > } > {code} > As we can see most of the processing is done async in IgniteRunnable() in > exchange-worker. > > We can move > {code:java} > cctx.mvcc().removeExplicitNodeLocks(node.id(), initialVersion()); > {code} > inside this Runnable's body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)