Hi all,

Following up on the AINode CI speedup shared earlier this month, I have
another round of CI improvements ready in PR #17692 (
https://github.com/apache/iotdb/pull/17692).

继上次分享 AINode CI 优化之后,本月又有一波 CI 加速进展想和大家同步,详见 PR #17692 (
https://github.com/apache/iotdb/pull/17692)。

---
Goal / 目标

Two PR-check pipelines — Cluster IT - 1C1D and Table Cluster IT - 1C1D —
were dragged down by their Windows runners, which are 67–77% slower than
Ubuntu for the same workload. Pipeline wall clock is max(Ubuntu, Windows),
so even though Ubuntu was already fast (~49 / ~39 min), Windows pulled the
totals up to ~87 / ~65 min.

Cluster IT - 1C1D 和 Table Cluster IT - 1C1D 这两条 PR 检查流水线一直被 Windows runner
拖慢——Windows 跑同样的工作量比 Ubuntu 慢 67–77%。流水线总耗时 = max(Ubuntu, Windows),所以即便
Ubuntu 已经很快(~49 / ~39 分钟),Windows 也会把总数拉到 ~87 / ~65 分钟。

---
Approach / 方案

Split each Windows job into 3 parallel matrix shards. Each shard runs ~1/3
of the IT classes selected by category annotation (LocalStandaloneIT /
TableLocalStandaloneIT), distributed by hash-mod. Ubuntu stays as a single
job — it wasn't the bottleneck, so sharding it would just add scheduling
overhead.

把 Windows job 切成 3 个并行的 matrix shard。每个 shard 通过类别注解(LocalStandaloneIT /
TableLocalStandaloneIT)挑出约 1/3 的 IT 类,按 hash-mod 分配。Ubuntu 保持单
job——它不是瓶颈,切了反而徒增 matrix 调度开销。

The shard list is generated at runtime into $RUNNER_TEMP/it-shard.txt and
consumed via Maven's -Dfailsafe.includesFile. This avoids Windows'
command-line length limit and stays robust as the test suite grows.

shard 列表在运行时写到 $RUNNER_TEMP/it-shard.txt,通过 Maven 的 -Dfailsafe.includesFile
读入。这样能绕开 Windows 命令行长度上限,后续测试套件再扩也不用改方案。

---
Results / 效果

┌─────────────────────────┬─────────┬─────────┬───────┐
│        Pipeline         │ Before  │  After  │ Saved │
├─────────────────────────┼─────────┼─────────┼───────┤
│ Cluster IT - 1C1D       │ ~87 min │ ~48 min │ −45%  │
├─────────────────────────┼─────────┼─────────┼───────┤
│ Table Cluster IT - 1C1D │ ~65 min │ ~40 min │ −38%  │
└─────────────────────────┴─────────┴─────────┴───────┘

Both pipelines are now capped by Ubuntu — Windows shards finish 10–16 min
ahead. 3-way sharding is the sweet spot; going to 4 or 5 shards would only
add matrix scheduling cost without reducing wall clock.

两条流水线现在都被 Ubuntu 卡住——Windows shard 比 Ubuntu 早完成 10–16 分钟。3 路分片刚好是甜点;继续切到 4
路或 5 路只会增加 matrix 调度开销,墙钟也降不下来,除非接下来去优化 Ubuntu 那一侧。

---
Pitfalls worth sharing / 踩过的坑

For anyone applying a similar approach elsewhere, two non-obvious bugs came
up during this work:

如果有同学要在别处用类似方案,有两个不太显眼的坑值得分享:

1. find ... | xargs -0 grep -l exits 123 on Windows Git Bash. Windows has a
much smaller ARG_MAX than Linux, so xargs batches the file list. Any batch
with zero matches makes grep return 1 → xargs returns 123 → set -o pipefail
fails the step. Fix: use a single grep -rl --include=... call instead.

1. find ... | xargs -0 grep -l 在 Windows Git Bash 下退出码 123。 Windows 的
ARG_MAX 比 Linux 小得多,xargs 会把文件列表切成多批传给 grep。只要某一批没有匹配,grep 就返回 1,xargs 整体返回
123,set -o pipefail 把整步骤判失败。修复:用一次性的 grep -rl --include=...。
2. Apache RAT flags any generated file inside the repo. Our
integration-test/it-shard.txt got reported as "unapproved license". target/
is RAT-excluded but mvn clean would wipe the file before it's read. Fix:
write to $RUNNER_TEMP/it-shard.txt, which lives outside the repo entirely.

2. Apache RAT 会扫到仓库里的任何生成文件。 我们最初写到 integration-test/it-shard.txt 被 RAT 报
"unapproved license"。target/ 虽然在 RAT 排除列表里,但 mvn clean 会先把它清掉。修复:写到
$RUNNER_TEMP/it-shard.txt,这是 runner 临时目录,完全在仓库外。

---
What's next / 后续可继续优化的瓶颈

A preview of where I'd like to look in future rounds:

后续想继续看的几条流水线:

- Unit-Test on Windows (~47 min, ~23% slower than Ubuntu) — gap is smaller,
but if anyone knows why Surefire runs slower on Windows, happy to chat.
- Multi-Cluster IT — not yet profiled; suspect cluster startup overhead.
- AINode cold build still costs ~7 min when the PyInstaller cache misses;
worth widening the cache hit rate across PRs/branches.
- Unit-Test 的 Windows runner(~47 分钟,比 Ubuntu 慢 ~23%)——差距比 IT 流水线小,但如果有同学了解
Surefire 在 Windows 上为什么变慢,欢迎一起讨论。
- Multi-Cluster IT ——还没做过 profiling,怀疑是集群启动开销。
- AINode cold build 在 PyInstaller 缓存 miss 时还要 ~7 分钟,看能不能让缓存在更多 PR/branch
之间复用。

Reviews on the PR are very welcome. Happy to walk through any of the
details online if useful.

欢迎 review PR。

Best regards,

Yuan Tian

Reply via email to