[
https://issues.apache.org/jira/browse/IGNITE-28684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081280#comment-18081280
]
Ignite TC Bot commented on IGNITE-28684:
----------------------------------------
Prepared a minimal draft fix and pushed it to the bot fork.
Branch: https://github.com/ignitetcbot/ignite/tree/ignite-28684-draft
Commit: f948d73cd6 (IGNITE-28684 Fix random MDC ring test failure)
The change keeps the randomized assignment for most nodes, but pins two
surviving node indexes to different data centers. This preserves the intended
MDC ring coverage while removing the invalid all-one-DC outcome that makes
checkHops(2) fail with expected:<2> but was:<0>.
Local notes: git diff --check is clean except the usual Windows LF->CRLF
warning. A targeted Maven run reached the test startup path, but this local
checkout/JDK 17 invocation needs the full Ignite --add-opens/--add-exports test
JVM configuration, so the local run stopped on module-access setup rather than
on the original assertion.
> MultiDataCenterRingTest.testRing can fail when random cluster has one
> surviving data center
> -------------------------------------------------------------------------------------------
>
> Key: IGNITE-28684
> URL: https://issues.apache.org/jira/browse/IGNITE-28684
> Project: Ignite
> Issue Type: Bug
> Reporter: Ignite TC Bot
> Priority: Major
> Labels: MakeTeamcityGreenAgain, ise
>
> h3. Failure
> TeamCity SPI (Discovery) failure on master-like code path:
> * Test:
> {code}org.apache.ignite.spi.discovery.tcp.MultiDataCenterRingTest.testRing{code}
> * Suite:
> {code}org.apache.ignite.testsuites.IgniteSpiDiscoverySelfTestSuite{code}
> * Build: https://ci2.ignite.apache.org/viewLog.html?buildId=9060064
> * Error: {code}java.lang.AssertionError: expected:<2> but was:<0>{code}
> h3. Likely root cause
> {code}MultiDataCenterRingTest.generateRandomDcOrderCluster(int cnt){code}
> assigns every node to {code}DC0{code} or {code}DC1{code} using
> {code}ThreadLocalRandom{code}. The test then stops node {code}cnt - 1{code}
> and node {code}0{code}, and expects {code}checkHops(2){code} to remain true.
> This is not guaranteed. If all surviving server nodes after those stops are
> assigned to the same data center,
> {code}TcpDiscoveryNodesRing.nextNode(){code} sorted by
> {code}MdcAwareNodesComparator{code} has no cross-DC boundary and
> {code}checkHops(2){code} counts {code}0{code} hops. That exactly matches the
> TeamCity failure.
> h3. Minimal fix
> Make the test topology deterministic enough to preserve both DCs among nodes
> that survive the explicit stops. For example, force two surviving node
> indexes, such as {code}1{code} and {code}2{code}, into different DCs, and
> keep the existing random assignment for the remaining nodes:
> {code:java}
> String dcId;
> if (i == 1)
> dcId = DC_ID_0;
> else if (i == 2)
> dcId = DC_ID_1;
> else
> dcId = rnd.nextBoolean() ? DC_ID_0 : DC_ID_1;
> System.setProperty(IgniteSystemProperties.IGNITE_DATA_CENTER_ID, dcId);
> {code}
> This keeps the random join order coverage but removes the invalid all-one-DC
> outcome for the second assertion.
> h3. Files to inspect
> *
> {code}modules/core/src/test/java/org/apache/ignite/spi/discovery/tcp/MultiDataCenterRingTest.java{code}
> *
> {code}modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/internal/TcpDiscoveryNodesRing.java{code}
> *
> {code}modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/internal/MdcAwareNodesComparator.java{code}
> h3. Retry
> Retry is justified as a short-term mitigation because the failure depends on
> random DC assignment. It does not fix the test defect.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)