[ 
https://issues.apache.org/jira/browse/IGNITE-28684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Pavlov updated IGNITE-28684:
-----------------------------------
    Labels: MakeTeamcityGreenAgain ise  (was: MakeTeamcityGreenAgain)

> MultiDataCenterRingTest.testRing can fail when random cluster has one 
> surviving data center
> -------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-28684
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28684
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ignite TC Bot
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain, ise
>
> h3. Failure
> TeamCity SPI (Discovery) failure on master-like code path:
> * Test: 
> {code}org.apache.ignite.spi.discovery.tcp.MultiDataCenterRingTest.testRing{code}
> * Suite: 
> {code}org.apache.ignite.testsuites.IgniteSpiDiscoverySelfTestSuite{code}
> * Build: https://ci2.ignite.apache.org/viewLog.html?buildId=9060064
> * Error: {code}java.lang.AssertionError: expected:<2> but was:<0>{code}
> h3. Likely root cause
> {code}MultiDataCenterRingTest.generateRandomDcOrderCluster(int cnt){code} 
> assigns every node to {code}DC0{code} or {code}DC1{code} using 
> {code}ThreadLocalRandom{code}. The test then stops node {code}cnt - 1{code} 
> and node {code}0{code}, and expects {code}checkHops(2){code} to remain true.
> This is not guaranteed. If all surviving server nodes after those stops are 
> assigned to the same data center, 
> {code}TcpDiscoveryNodesRing.nextNode(){code} sorted by 
> {code}MdcAwareNodesComparator{code} has no cross-DC boundary and 
> {code}checkHops(2){code} counts {code}0{code} hops. That exactly matches the 
> TeamCity failure.
> h3. Minimal fix
> Make the test topology deterministic enough to preserve both DCs among nodes 
> that survive the explicit stops. For example, force two surviving node 
> indexes, such as {code}1{code} and {code}2{code}, into different DCs, and 
> keep the existing random assignment for the remaining nodes:
> {code:java}
> String dcId;
> if (i == 1)
>     dcId = DC_ID_0;
> else if (i == 2)
>     dcId = DC_ID_1;
> else
>     dcId = rnd.nextBoolean() ? DC_ID_0 : DC_ID_1;
> System.setProperty(IgniteSystemProperties.IGNITE_DATA_CENTER_ID, dcId);
> {code}
> This keeps the random join order coverage but removes the invalid all-one-DC 
> outcome for the second assertion.
> h3. Files to inspect
> * 
> {code}modules/core/src/test/java/org/apache/ignite/spi/discovery/tcp/MultiDataCenterRingTest.java{code}
> * 
> {code}modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/internal/TcpDiscoveryNodesRing.java{code}
> * 
> {code}modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/internal/MdcAwareNodesComparator.java{code}
> h3. Retry
> Retry is justified as a short-term mitigation because the failure depends on 
> random DC assignment. It does not fix the test defect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to