[ 
https://issues.apache.org/jira/browse/IGNITE-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753243#comment-17753243
 ] 

Kirill Gusakov commented on IGNITE-20053:
-----------------------------------------

LGTM

> Empty data nodes are returned by data nodes engine
> --------------------------------------------------
>
>                 Key: IGNITE-20053
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20053
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Chudov
>            Assignee: Alexander Lapin
>            Priority: Major
>              Labels: ignite-3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY 
> and it is refreshed by topology listener on topology events and stores 
> logical topology. If the value stored by this key is null, then empty data 
> nodes are returned from data nodes engine on data nodes calculation for a 
> distribution zone. As a result, empty assignments are calculated for 
> partitions, which leads to exception described in IGNITE-19466.
> Some integration tests, for example, ItRebalanceDistributedTest are flaky 
> because of possible problems with value of 
> DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by 
> data nodes engine.
> Actually, the empty data nodes collection is a wrong result for this case 
> because the current logical topology is not empty.
> h3. UPD #1
> *1.* The reason for empty data nodes assertion is race between join 
> completion and thus firing logical topology updates and DZM start. Literally, 
> it's required to put 
> {code:java}
> nodes.stream().forEach(Node::waitWatches); {code}
> before
> {code:java}
> assertThat(
>         allOf(nodes.get(0).cmgManager.onJoinReady(), 
> nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()),
>         willCompleteSuccessfully()
> ); {code}
> in 
> org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before.
>  
> *2.* However, that's not the whole story. We also faced 
> {code:java}
> Unable to initialize the cluster: null{code}
> because cmg init failed with TimeoutException because we start CMGManager 
> asynchronously, which is incorrect. So if we move cmgManager to 
> firstComponents that will solve the issue.
> {code:java}
> List<IgniteComponent> firstComponents = List.of(
>         vaultManager,
>         nodeCfgMgr,
>         clusterService,
>         raftManager,
>         cmgManager
> ); {code}
>  
> *3.* Still it's not the whole story. testTwoQueuedRebalances failed because 
> we didn't retrieved an expected stable assignments after table creation
> {code:java}
> await(nodes.get(0).tableManager.createTableAsync(
>         "TBL1",
>         ZONE_1_NAME,
>         tblChanger -> SchemaConfigurationConverter.convert(schTbl1, 
> tblChanger)
> ));
> assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
> The reason for that is that assignments calculation is an async process, so 
> there are no guarantees that we will retrieve proper assignments right after 
> table creation completes. So we might substitute
> {code:java}
> assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
> with
> {code:java}
> assertTrue(waitForCondition(() -> getPartitionClusterNodes(0, 0).size() == 1, 
> 10_000));{code}
> Please pay attention that there are multiple places where we retrieve 
> assignments and expect them to be ready.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to