[ https://issues.apache.org/jira/browse/IGNITE-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Lapin updated IGNITE-20053: ------------------------------------- Description: There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and it is refreshed by topology listener on topology events and stores logical topology. If the value stored by this key is null, then empty data nodes are returned from data nodes engine on data nodes calculation for a distribution zone. As a result, empty assignments are calculated for partitions, which leads to exception described in IGNITE-19466. Some integration tests, for example, ItRebalanceDistributedTest are flaky because of possible problems with value of DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by data nodes engine. Actually, the empty data nodes collection is a wrong result for this case because the current logical topology is not empty. h3. UPD #1 *1.* The reason for empty data nodes assertion is race between join completion and thus firing logical topology updates and DZM start. Literally, it's required to put {code:java} nodes.stream().forEach(Node::waitWatches); {code} before {code:java} assertThat( allOf(nodes.get(0).cmgManager.onJoinReady(), nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()), willCompleteSuccessfully() ); {code} in org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before. *2.* However, that's not the whole story. We also faced {code:java} Unable to initialize the cluster: null{code} because cmg init failed with TimeoutException because we start CMGManager asynchronously, which is incorrect. So if we move cmgManager to firstComponents that will solve the issue. {code:java} List<IgniteComponent> firstComponents = List.of( vaultManager, nodeCfgMgr, clusterService, raftManager, cmgManager ); {code} *3.* Still it's not the whole story. testTwoQueuedRebalances failed because we didn't retrieved an expected stable assignments after table creation {code:java} await(nodes.get(0).tableManager.createTableAsync( "TBL1", ZONE_1_NAME, tblChanger -> SchemaConfigurationConverter.convert(schTbl1, tblChanger) )); assertEquals(1, getPartitionClusterNodes(0, 0).size());{code} The reason for that is that assignments calculation is an async process, so there are no guarantees that we will retrieve proper assignments right after table creation completes. So we might substitute {code:java} assertEquals(1, getPartitionClusterNodes(0, 0).size());{code} with {code:java} assertTrue(waitForCondition(() -> getPartitionClusterNodes(0, 0).size() == 1, 1_000));{code} was: There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and it is refreshed by topology listener on topology events and stores logical topology. If the value stored by this key is null, then empty data nodes are returned from data nodes engine on data nodes calculation for a distribution zone. As a result, empty assignments are calculated for partitions, which leads to exception described in IGNITE-19466. Some integration tests, for example, ItRebalanceDistributedTest are flaky because of possible problems with value of DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by data nodes engine. Actually, the empty data nodes collection is a wrong result for this case because the current logical topology is not empty. h3. UPD #1 The reason for empty data nodes assertion is race between join completion and thus firing logical topology updates and DZM start. Literally, it's required to put {code:java} nodes.stream().forEach(Node::waitWatches); {code} before {code:java} assertThat( allOf(nodes.get(0).cmgManager.onJoinReady(), nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()), willCompleteSuccessfully() ); {code} in org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before. However, that's not the whole story. We also faced {code:java} Unable to initialize the cluster: null{code} because cmg init failed with TimeoutException because we start CMGManager asynchronously, which is incorrect. So if we move cmgManager to firstComponents that will solve the issue. {code:java} List<IgniteComponent> firstComponents = List.of( vaultManager, nodeCfgMgr, clusterService, raftManager, cmgManager ); {code} > Empty data nodes are returned by data nodes engine > -------------------------------------------------- > > Key: IGNITE-20053 > URL: https://issues.apache.org/jira/browse/IGNITE-20053 > Project: Ignite > Issue Type: Bug > Reporter: Denis Chudov > Assignee: Denis Chudov > Priority: Major > Labels: ignite-3 > > There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY > and it is refreshed by topology listener on topology events and stores > logical topology. If the value stored by this key is null, then empty data > nodes are returned from data nodes engine on data nodes calculation for a > distribution zone. As a result, empty assignments are calculated for > partitions, which leads to exception described in IGNITE-19466. > Some integration tests, for example, ItRebalanceDistributedTest are flaky > because of possible problems with value of > DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by > data nodes engine. > Actually, the empty data nodes collection is a wrong result for this case > because the current logical topology is not empty. > h3. UPD #1 > *1.* The reason for empty data nodes assertion is race between join > completion and thus firing logical topology updates and DZM start. Literally, > it's required to put > {code:java} > nodes.stream().forEach(Node::waitWatches); {code} > before > {code:java} > assertThat( > allOf(nodes.get(0).cmgManager.onJoinReady(), > nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()), > willCompleteSuccessfully() > ); {code} > in > org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before. > > *2.* However, that's not the whole story. We also faced > {code:java} > Unable to initialize the cluster: null{code} > because cmg init failed with TimeoutException because we start CMGManager > asynchronously, which is incorrect. So if we move cmgManager to > firstComponents that will solve the issue. > {code:java} > List<IgniteComponent> firstComponents = List.of( > vaultManager, > nodeCfgMgr, > clusterService, > raftManager, > cmgManager > ); {code} > > *3.* Still it's not the whole story. testTwoQueuedRebalances failed because > we didn't retrieved an expected stable assignments after table creation > {code:java} > await(nodes.get(0).tableManager.createTableAsync( > "TBL1", > ZONE_1_NAME, > tblChanger -> SchemaConfigurationConverter.convert(schTbl1, > tblChanger) > )); > assertEquals(1, getPartitionClusterNodes(0, 0).size());{code} > The reason for that is that assignments calculation is an async process, so > there are no guarantees that we will retrieve proper assignments right after > table creation completes. So we might substitute > {code:java} > assertEquals(1, getPartitionClusterNodes(0, 0).size());{code} > with > {code:java} > assertTrue(waitForCondition(() -> getPartitionClusterNodes(0, 0).size() == 1, > 1_000));{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)