[ https://issues.apache.org/jira/browse/IGNITE-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753243#comment-17753243 ]
Kirill Gusakov commented on IGNITE-20053: ----------------------------------------- LGTM > Empty data nodes are returned by data nodes engine > -------------------------------------------------- > > Key: IGNITE-20053 > URL: https://issues.apache.org/jira/browse/IGNITE-20053 > Project: Ignite > Issue Type: Bug > Reporter: Denis Chudov > Assignee: Alexander Lapin > Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY > and it is refreshed by topology listener on topology events and stores > logical topology. If the value stored by this key is null, then empty data > nodes are returned from data nodes engine on data nodes calculation for a > distribution zone. As a result, empty assignments are calculated for > partitions, which leads to exception described in IGNITE-19466. > Some integration tests, for example, ItRebalanceDistributedTest are flaky > because of possible problems with value of > DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by > data nodes engine. > Actually, the empty data nodes collection is a wrong result for this case > because the current logical topology is not empty. > h3. UPD #1 > *1.* The reason for empty data nodes assertion is race between join > completion and thus firing logical topology updates and DZM start. Literally, > it's required to put > {code:java} > nodes.stream().forEach(Node::waitWatches); {code} > before > {code:java} > assertThat( > allOf(nodes.get(0).cmgManager.onJoinReady(), > nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()), > willCompleteSuccessfully() > ); {code} > in > org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before. > > *2.* However, that's not the whole story. We also faced > {code:java} > Unable to initialize the cluster: null{code} > because cmg init failed with TimeoutException because we start CMGManager > asynchronously, which is incorrect. So if we move cmgManager to > firstComponents that will solve the issue. > {code:java} > List<IgniteComponent> firstComponents = List.of( > vaultManager, > nodeCfgMgr, > clusterService, > raftManager, > cmgManager > ); {code} > > *3.* Still it's not the whole story. testTwoQueuedRebalances failed because > we didn't retrieved an expected stable assignments after table creation > {code:java} > await(nodes.get(0).tableManager.createTableAsync( > "TBL1", > ZONE_1_NAME, > tblChanger -> SchemaConfigurationConverter.convert(schTbl1, > tblChanger) > )); > assertEquals(1, getPartitionClusterNodes(0, 0).size());{code} > The reason for that is that assignments calculation is an async process, so > there are no guarantees that we will retrieve proper assignments right after > table creation completes. So we might substitute > {code:java} > assertEquals(1, getPartitionClusterNodes(0, 0).size());{code} > with > {code:java} > assertTrue(waitForCondition(() -> getPartitionClusterNodes(0, 0).size() == 1, > 10_000));{code} > Please pay attention that there are multiple places where we retrieve > assignments and expect them to be ready. -- This message was sent by Atlassian Jira (v8.20.10#820010)