[ 
https://issues.apache.org/jira/browse/IGNITE-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-20053:
-------------------------------------
    Description: 
There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and 
it is refreshed by topology listener on topology events and stores logical 
topology. If the value stored by this key is null, then empty data nodes are 
returned from data nodes engine on data nodes calculation for a distribution 
zone. As a result, empty assignments are calculated for partitions, which leads 
to exception described in IGNITE-19466.

Some integration tests, for example, ItRebalanceDistributedTest are flaky 
because of possible problems with value of 
DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by data 
nodes engine.

Actually, the empty data nodes collection is a wrong result for this case 
because the current logical topology is not empty.
h3. UPD #1

*1.* The reason for empty data nodes assertion is race between join completion 
and thus firing logical topology updates and DZM start. Literally, it's 
required to put 
{code:java}
nodes.stream().forEach(Node::waitWatches); {code}
before
{code:java}
assertThat(
        allOf(nodes.get(0).cmgManager.onJoinReady(), 
nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()),
        willCompleteSuccessfully()
); {code}
in 
org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before.

 

*2.* However, that's not the whole story. We also faced 
{code:java}
Unable to initialize the cluster: null{code}
because cmg init failed with TimeoutException because we start CMGManager 
asynchronously, which is incorrect. So if we move cmgManager to firstComponents 
that will solve the issue.
{code:java}
List<IgniteComponent> firstComponents = List.of(
        vaultManager,
        nodeCfgMgr,
        clusterService,
        raftManager,
        cmgManager
); {code}
 

*3.* Still it's not the whole story. testTwoQueuedRebalances failed because we 
didn't retrieved an expected stable assignments after table creation
{code:java}
await(nodes.get(0).tableManager.createTableAsync(
        "TBL1",
        ZONE_1_NAME,
        tblChanger -> SchemaConfigurationConverter.convert(schTbl1, tblChanger)
));

assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
The reason for that is that assignments calculation is an async process, so 
there are no guarantees that we will retrieve proper assignments right after 
table creation completes. So we might substitute
{code:java}
assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
with
{code:java}
assertTrue(waitForCondition(() -> getPartitionClusterNodes(0, 0).size() == 1, 
1_000));{code}

  was:
There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and 
it is refreshed by topology listener on topology events and stores logical 
topology. If the value stored by this key is null, then empty data nodes are 
returned from data nodes engine on data nodes calculation for a distribution 
zone. As a result, empty assignments are calculated for partitions, which leads 
to exception described in IGNITE-19466.

Some integration tests, for example, ItRebalanceDistributedTest are flaky 
because of possible problems with value of 
DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by data 
nodes engine.

Actually, the empty data nodes collection is a wrong result for this case 
because the current logical topology is not empty.
h3. UPD #1

The reason for empty data nodes assertion is race between join completion and 
thus firing logical topology updates and DZM start. Literally, it's required to 
put 
{code:java}
nodes.stream().forEach(Node::waitWatches); {code}
before
{code:java}
assertThat(
        allOf(nodes.get(0).cmgManager.onJoinReady(), 
nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()),
        willCompleteSuccessfully()
); {code}
in 
org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before.

 

However, that's not the whole story. We also faced 
{code:java}
Unable to initialize the cluster: null{code}
because cmg init failed with TimeoutException because we start CMGManager 
asynchronously, which is incorrect. So if we move cmgManager to firstComponents 
that will solve the issue.
{code:java}
List<IgniteComponent> firstComponents = List.of(
        vaultManager,
        nodeCfgMgr,
        clusterService,
        raftManager,
        cmgManager
); {code}


> Empty data nodes are returned by data nodes engine
> --------------------------------------------------
>
>                 Key: IGNITE-20053
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20053
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Chudov
>            Assignee: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>
> There is a meta storage key called DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY 
> and it is refreshed by topology listener on topology events and stores 
> logical topology. If the value stored by this key is null, then empty data 
> nodes are returned from data nodes engine on data nodes calculation for a 
> distribution zone. As a result, empty assignments are calculated for 
> partitions, which leads to exception described in IGNITE-19466.
> Some integration tests, for example, ItRebalanceDistributedTest are flaky 
> because of possible problems with value of 
> DISTRIBUTION_ZONES_LOGICAL_TOPOLOGY_KEY and empty data nodes calculated by 
> data nodes engine.
> Actually, the empty data nodes collection is a wrong result for this case 
> because the current logical topology is not empty.
> h3. UPD #1
> *1.* The reason for empty data nodes assertion is race between join 
> completion and thus firing logical topology updates and DZM start. Literally, 
> it's required to put 
> {code:java}
> nodes.stream().forEach(Node::waitWatches); {code}
> before
> {code:java}
> assertThat(
>         allOf(nodes.get(0).cmgManager.onJoinReady(), 
> nodes.get(1).cmgManager.onJoinReady(), nodes.get(2).cmgManager.onJoinReady()),
>         willCompleteSuccessfully()
> ); {code}
> in 
> org.apache.ignite.internal.configuration.storage.ItRebalanceDistributedTest#before.
>  
> *2.* However, that's not the whole story. We also faced 
> {code:java}
> Unable to initialize the cluster: null{code}
> because cmg init failed with TimeoutException because we start CMGManager 
> asynchronously, which is incorrect. So if we move cmgManager to 
> firstComponents that will solve the issue.
> {code:java}
> List<IgniteComponent> firstComponents = List.of(
>         vaultManager,
>         nodeCfgMgr,
>         clusterService,
>         raftManager,
>         cmgManager
> ); {code}
>  
> *3.* Still it's not the whole story. testTwoQueuedRebalances failed because 
> we didn't retrieved an expected stable assignments after table creation
> {code:java}
> await(nodes.get(0).tableManager.createTableAsync(
>         "TBL1",
>         ZONE_1_NAME,
>         tblChanger -> SchemaConfigurationConverter.convert(schTbl1, 
> tblChanger)
> ));
> assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
> The reason for that is that assignments calculation is an async process, so 
> there are no guarantees that we will retrieve proper assignments right after 
> table creation completes. So we might substitute
> {code:java}
> assertEquals(1, getPartitionClusterNodes(0, 0).size());{code}
> with
> {code:java}
> assertTrue(waitForCondition(() -> getPartitionClusterNodes(0, 0).size() == 1, 
> 1_000));{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to