[ https://issues.apache.org/jira/browse/IGNITE-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156822#comment-16156822 ]
Andrew Mashenkov commented on IGNITE-6256: ------------------------------------------ TC tests look fine. > When a node becomes segmented an AssertionError is thrown during > GridDhtPartitionTopologyImpl.removeNode > -------------------------------------------------------------------------------------------------------- > > Key: IGNITE-6256 > URL: https://issues.apache.org/jira/browse/IGNITE-6256 > Project: Ignite > Issue Type: Bug > Components: general > Affects Versions: 1.8 > Reporter: Alexandr Fedotov > Assignee: Andrew Mashenkov > Fix For: 2.3 > > > The assert is as follows: > exception="java.lang.AssertionError: null > at > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.removeNode(GridDhtPartitionTopologyImpl.java:1422) > at > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:490) > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:769) > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:504) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1689) > at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > at java.lang.Thread.run(Thread.java:745) > Below is the sequence of steps that leads to the assertion error: > 1) A node becomes SEGMENTED when it's determined by SegmentCheckWorker, after > an EVT_NODE_FAILED has been received. > 2) It gets visibleRemoteNodes from it's TcpDiscoveryNodesRing > 3) Clears the TcpDiscoveryNodesRing leaving only self on the list. The node > ring is used to determine if a node is alive > during DiscoCache creation > 4) After that, the node initiates removal of all the nodes read in step 2 > 5) For each node, it sends an EVT_NODE_FAILED to the corresponding > DiscoverySpiListener > providing a topology containing all the nodes except already processed > 6) This event gets into GridDiscoveryManager > 7) The node gets removed from alive nodes for every DiscoCache in > discoCacheHist > 8) Topology change is detected > 9) Creation of a new DiscoCache is attempted. At this moment every remote > node is not available due to the > TcpDiscoveryNodesRing has been cleared, thus resulting in a DiscoCache with > empty alives > 10) The event with the created DiscoCache and the new topology version is > passed to DiscoveryWorker > 11) The event is eventually handled by DiscoveryWorker and is recorded by > DiscoveryWorker#recordEvent > 12) The recording is handled by GridEventStorageManager which notifies every > listener for this event type (EVT_NODE_FAILED) > 13) One of the listeners is GridCachePartitionExchangeManager#discoLsnr > It creates a new GridDhtPartitionsExchangeFuture with the empty DiscoCache > received with the event and enqueues it > 14) The future gets eventually handled by GridDhtPartitionsExchangeFuture and > initialized > 15) updateTopologies is called, which for each GridCacheContext gets its > topology (GridDhtPartitionTopology) > and calls GridDhtPartitionTopology#updateTopologyVersion > 16) DiscoCache for GridDhtPartitionTopology is assigned from the one of the > GridDhtPartitionsExchangeFuture. > The assigned DiscoCache has empty alives at the moment > 15) A distributed exchange is handled > (GridDhtPartitionsExchangeFuture#distributedExchange) > 16) For each cache context GridCacheContext, for its topology > (GridDhtPartitionTopologyImpl) GridDhtPartitionTopologyImpl#beforeExchange is > called > 17) The fact that the node has left is determined and > GridDhtPartitionTopologyImpl#removeNode is called to handle it > 18) An attempt is made to get the alive coordinator node by calling > DiscoCache#oldestAliveServerNode > 19) null is returned which results in an AssertionError > The fix should probably prevent initiating exchange futures if a node has > segmented. -- This message was sent by Atlassian JIRA (v6.4.14#64029)