Hi all,

最近我对 IoTDB 的 CI 流水线做了一些优化,主要瓶颈是 Datanode 单元测试串行执行带来的 JVM 冷启动开销。这里通过两个 PR 启用
surefire 的并行 forks,效果很显著,想和社区同步一下。

Recently I worked on speeding up the IoTDB CI pipeline. The main bottleneck
was the Datanode unit tests running serially, dominated by JVM cold-start
cost. Two PRs address this by enabling parallel surefire forks, with
notable wall-clock savings I'd like to share with the community.

背景 / Background

iotdb-core/datanode/pom.xml 中已经声明了:

<workingDirectory>${project.build.directory}/fork_${surefire.forkNumber}</workingDirectory>

但 ${surefire.forkNumber} 仅在 forkCount > 1 时生效。也就是说,项目原本已经为并行 forks
做好了文件系统隔离,但从未真正启用。再叠加 surefire 默认的 forkCount=1,导致在 4-core runner
上同一时刻只有一个测试类在跑。

iotdb-core/datanode/pom.xml already declares per-fork working directories
(fork_${surefire.forkNumber}), which only take effect when forkCount > 1.
The project was plumbed for parallel forks but never enabled them. Combined
with surefire's default forkCount=1, this meant only one test class ran at
any moment on a 4-vCPU runner.

注意 reuseForks=false 仍然保留了类级别的隔离(每个测试类仍是新 JVM),并行只发生在 fork 之间,不影响现有测试隔离语义。

Note that reuseForks=false preserves intra-fork isolation: each test class
still runs in a fresh JVM, so static singletons reset per class as before.
Only cross-fork parallelism changes.

PR 17697:Codecov CI 加速(已合并)/ Codecov CI Speedup (Merged)

Link: https://github.com/apache/iotdb/pull/17697

修改 .github/workflows/sonar-codecov.yml,给 maven 命令传入 -DforkCount=4。

Modified .github/workflows/sonar-codecov.yml to pass -DforkCount=4 to the
maven invocation.

实测效果 / Measured impact:

┌────────────────┬──────────┬─────────────┬─────────────┐
│     Phase      │ Baseline │ forkCount=2 │ forkCount=4 │
├────────────────┼──────────┼─────────────┼─────────────┤
│ DataNode UTs   │ 51 min   │ 34 min      │ 32 min      │
├────────────────┼──────────┼─────────────┼─────────────┤
│ ConfigNode UTs │ 6 min    │ 4 min       │ 3 min       │
├────────────────┼──────────┼─────────────┼─────────────┤
│ Consensus UTs  │ 4 min    │ 2 min       │ 2 min       │
├────────────────┼──────────┼─────────────┼─────────────┤
│ Job total      │ 66 min   │ 45 min      │ 41 min      │
└────────────────┴──────────┴─────────────┴─────────────┘

总体:66 → 41 min(–38%)。forkCount=4 已经把 4 vCPU 跑满,瓶颈下移到单个最慢测试类(127s)。

Net: 66 → 41 min (–38%). forkCount=4 saturates the 4-vCPU runner; the floor
is now set by long-tail tests (slowest single class ~127s).

PR 17698:Unit-Test 流水线加速(待 review)/ Unit-Test Pipeline Speedup (Pending
review)

Link: https://github.com/apache/iotdb/pull/17698

修改 .github/workflows/unit-test.yml,给 datanode UT 调用加上 -DforkCount=3,后续又试到 4。

Modified .github/workflows/unit-test.yml to pass -DforkCount=3 (later
bumped to 4) for the datanode UT invocation.

Baseline:

┌──────────────────────────────────────────┬────────────┐
│                   Job                    │ Wall clock │
├──────────────────────────────────────────┼────────────┤
│ unit-test (17, windows-latest, datanode) │ ~56 min    │
├──────────────────────────────────────────┼────────────┤
│ unit-test (17, ubuntu-latest, datanode)  │ ~38 min    │
└──────────────────────────────────────────┴────────────┘

forkCount=3 后实测:Windows 降到 ~33 min(–41%),Ubuntu 降到 ~22 min(–40%)。整条
Unit-Test pipeline 受 Windows datanode 瓶颈影响,预计从 ~56 min 降到 ~22-28 min。

After forkCount=3: Windows ~33 min (–41%), Ubuntu ~22 min (–40%). The whole
Unit-Test pipeline is gated by Windows datanode, so total wall clock drops
from ~56 to ~22-28 min.

安全性审计 / Safety Audit

并行 forks 之间的冲突风险已经过审计:

- Datanode UT 不绑定 socket(无 ServerSocket / DatagramSocket / .bind() /
TServer.serve())。
- 没有测试使用 java.io.tmpdir 或固定绝对路径;相对路径由 per-fork workingDirectory 隔离。
- reuseForks=false 保留了类级别隔离,static 单例每个 class 仍会重置。

Cross-fork conflict risk has been audited:

- Datanode UTs do not bind sockets (no ServerSocket / DatagramSocket /
.bind() / TServer.serve()).
- No test uses java.io.tmpdir or fixed absolute paths; relative paths are
isolated by per-fork workingDirectory.
- reuseForks=false preserves per-class isolation — static singletons still
reset per class.

资源预算 / Resource Budget

在 16 GB GitHub-hosted runner 上:4 × -Xmx1024m + 每个 JVM 的开销 ≈ 5 GB,内存仍有充足余量。

On 16 GB GitHub-hosted runners: 4 × -Xmx1024m + per-JVM overhead ≈ 5 GB,
comfortable memory headroom.

为什么改 workflow 而不是 pom / Why workflow-only

把 flag 放在 workflow 而不是 pom,可以保持本地 mvn test 行为不变,开发者笔记本不会被 3-4 个并行 JVM
占满,也方便在出现问题时快速回滚。

Setting the flag in the workflow rather than in the pom keeps local mvn
test behavior unchanged, so contributors' laptops aren't surprised by
parallel JVMs. It's also easy to revert if anything regresses.

---
欢迎大家 review PR 17698。如果对 forkCount 还有更激进的策略建议(比如 per-runner 设置不同值,或者在其他
workflow 也启用),也欢迎讨论。

Reviews on PR 17698 are very welcome. Happy to discuss more aggressive fork
strategies (e.g., per-runner forkCount, or enabling parallel forks in other
workflows) if anyone has ideas.

Best regards,

Yuan Tian

Reply via email to