[ 
https://issues.apache.org/jira/browse/IGNITE-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155099#comment-16155099
 ] 

Andrew Mashenkov commented on IGNITE-6256:
------------------------------------------

Seems, this bug was introduced by IGNITE-4779.

> When a node becomes segmented an AssertionError is thrown during 
> GridDhtPartitionTopologyImpl.removeNode
> --------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-6256
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6256
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.8
>            Reporter: Alexandr Fedotov
>            Assignee: Andrew Mashenkov
>             Fix For: 2.3
>
>
> The assert is as follows:
> exception="java.lang.AssertionError: null
>  at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.removeNode(GridDhtPartitionTopologyImpl.java:1422)
>  at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:490)
>  at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:769)
>  at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:504)
>  at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1689)
>  at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>  at java.lang.Thread.run(Thread.java:745)
> Below is the sequence of steps that leads to the assertion error:
> 1) A node becomes SEGMENTED when it's determined by SegmentCheckWorker, after 
> an EVT_NODE_FAILED has been received.
> 2) It gets visibleRemoteNodes from it's TcpDiscoveryNodesRing
> 3) Clears the TcpDiscoveryNodesRing leaving only self on the list. The node 
> ring is used to determine if a node is alive
> during DiscoCache creation
> 4) After that, the node initiates removal of all the nodes read in step 2
> 5) For each node, it sends an EVT_NODE_FAILED to the corresponding 
> DiscoverySpiListener
> providing a topology containing all the nodes except already processed
> 6) This event gets into GridDiscoveryManager 
> 7) The node gets removed from alive nodes for every DiscoCache in 
> discoCacheHist
> 8) Topology change is detected
> 9) Creation of a new DiscoCache is attempted. At this moment every remote 
> node is not available due to the
> TcpDiscoveryNodesRing has been cleared, thus resulting in a DiscoCache with 
> empty alives
> 10) The event with the created DiscoCache and the new topology version is 
> passed to DiscoveryWorker
> 11) The event is eventually handled by DiscoveryWorker and is recorded by 
> DiscoveryWorker#recordEvent
> 12) The recording is handled by GridEventStorageManager which notifies every 
> listener for this event type (EVT_NODE_FAILED)
> 13) One of the listeners is GridCachePartitionExchangeManager#discoLsnr
> It creates a new GridDhtPartitionsExchangeFuture with the empty DiscoCache 
> received with the event and enqueues it
> 14) The future gets eventually handled by GridDhtPartitionsExchangeFuture and 
> initialized
> 15) updateTopologies is called, which for each GridCacheContext gets its 
> topology (GridDhtPartitionTopology)
> and calls GridDhtPartitionTopology#updateTopologyVersion
> 16) DiscoCache for GridDhtPartitionTopology is assigned from the one of the 
> GridDhtPartitionsExchangeFuture.
> The assigned DiscoCache has empty alives at the moment
> 15) A distributed exchange is handled 
> (GridDhtPartitionsExchangeFuture#distributedExchange)
> 16) For each cache context GridCacheContext, for its topology 
> (GridDhtPartitionTopologyImpl) GridDhtPartitionTopologyImpl#beforeExchange is 
> called
> 17) The fact that the node has left is determined and 
> GridDhtPartitionTopologyImpl#removeNode is called to handle it
> 18) An attempt is made to get the alive coordinator node by calling 
> DiscoCache#oldestAliveServerNode
> 19) null is returned which results in an AssertionError
> The fix should probably prevent initiating exchange futures if a node has 
> segmented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to