Ignite TC Bot created IGNITE-28684:
--------------------------------------

             Summary: MultiDataCenterRingTest.testRing can fail when random 
cluster has one surviving data center
                 Key: IGNITE-28684
                 URL: https://issues.apache.org/jira/browse/IGNITE-28684
             Project: Ignite
          Issue Type: Bug
            Reporter: Ignite TC Bot


h3. Failure

TeamCity SPI (Discovery) failure on master-like code path:
* Test: 
{code}org.apache.ignite.spi.discovery.tcp.MultiDataCenterRingTest.testRing{code}
* Suite: 
{code}org.apache.ignite.testsuites.IgniteSpiDiscoverySelfTestSuite{code}
* Build: https://ci2.ignite.apache.org/viewLog.html?buildId=9060064
* Error: {code}java.lang.AssertionError: expected:<2> but was:<0>{code}

h3. Likely root cause

{code}MultiDataCenterRingTest.generateRandomDcOrderCluster(int cnt){code} 
assigns every node to {code}DC0{code} or {code}DC1{code} using 
{code}ThreadLocalRandom{code}. The test then stops node {code}cnt - 1{code} and 
node {code}0{code}, and expects {code}checkHops(2){code} to remain true.

This is not guaranteed. If all surviving server nodes after those stops are 
assigned to the same data center, {code}TcpDiscoveryNodesRing.nextNode(){code} 
sorted by {code}MdcAwareNodesComparator{code} has no cross-DC boundary and 
{code}checkHops(2){code} counts {code}0{code} hops. That exactly matches the 
TeamCity failure.

h3. Minimal fix

Make the test topology deterministic enough to preserve both DCs among nodes 
that survive the explicit stops. For example, force two surviving node indexes, 
such as {code}1{code} and {code}2{code}, into different DCs, and keep the 
existing random assignment for the remaining nodes:

{code:java}
String dcId;

if (i == 1)
    dcId = DC_ID_0;
else if (i == 2)
    dcId = DC_ID_1;
else
    dcId = rnd.nextBoolean() ? DC_ID_0 : DC_ID_1;

System.setProperty(IgniteSystemProperties.IGNITE_DATA_CENTER_ID, dcId);
{code}

This keeps the random join order coverage but removes the invalid all-one-DC 
outcome for the second assertion.

h3. Files to inspect

* 
{code}modules/core/src/test/java/org/apache/ignite/spi/discovery/tcp/MultiDataCenterRingTest.java{code}
* 
{code}modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/internal/TcpDiscoveryNodesRing.java{code}
* 
{code}modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/internal/MdcAwareNodesComparator.java{code}

h3. Retry

Retry is justified as a short-term mitigation because the failure depends on 
random DC assignment. It does not fix the test defect.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to