Hi all,
I'd like to share a recent optimization to the Cluster IT - 1C1D1A (AINode)
pipeline that brought its runtime down from ~52 minutes to ~27 minutes (a
48% reduction).[1]
想和大家分享一下最近对 Cluster IT - 1C1D1A (AINode) pipeline 的一次优化,将其运行时间从 ~52 分钟降至
~27 分钟(缩短 48%)。
---
Goal / 优化目标
The AINode CI was the bottleneck of the overall PR check pipeline, often
running 50+ minutes while other jobs finished in 30 minutes or less.
Profiling showed that 74% of the time in the IT phase was spent on cluster
startup, 22% on PyInstaller packaging the AINode Python binary, and only
about 17% on actual test execution.
AINode CI 一直是 PR 检查 pipeline 的瓶颈,整体经常需要 50+ 分钟才能完成,而其他 job 通常 30
分钟以内。通过分析日志发现,IT 阶段 74% 的时间花在集群启动,22% 花在 PyInstaller 打包 AINode 的 Python
二进制,真正测试执行只占约 17%。
---
Approach / 优化方式
Two targeted changes (PR #17687):
两项针对性改动(PR #17687):
1. Test consolidation — shared cluster across test classes
1. 测试合并 —— 多个测试类共享同一个集群
Previously, each of the 7 AINode IT test classes started its own 1C+1D+1A
cluster, leading to 8 total cluster startups per run (AINodeClusterConfigIT
even started one per @Test method due to @Before/@After). I merged the 5
compatible classes (DeviceManage, ModelManage, CallInference, Forecast,
InstanceManagement) into a single AINodeSharedClusterIT using
@BeforeClass/@AfterClass, so all 15 test methods share one cluster.
AINodeClusterConfigIT was also converted to class-level lifecycle.
AINodeConcurrentForecastIT stayed separate (different data setup, heavy
concurrent load).
原本 7 个 AINode IT 测试类各自启动一套 1C+1D+1A 集群,整个 pipeline 共启动 8
次集群(AINodeClusterConfigIT 由于使用 @Before/@After,甚至每个 @Test 方法启动一次)。我将 5
个相互兼容的测试类(DeviceManage、ModelManage、CallInference、Forecast、InstanceManagement)合并为一个
AINodeSharedClusterIT,使用 @BeforeClass/@AfterClass 让所有 15
个测试方法共享同一个集群。AINodeClusterConfigIT 也改成 class
级别生命周期。AINodeConcurrentForecastIT 因数据准备方式不同且涉及高并发负载,保持独立。
Cluster startups: 8 → 3.
集群启动次数:8 → 3。
2. PyInstaller dist caching — skip rebuild when source unchanged
2. PyInstaller dist 缓存 —— 源码未变时跳过重新打包
PyInstaller's analysis phase scans thousands of hidden imports from
torch/transformers/numpy and takes ~10 minutes per run, even though AINode
source rarely changes between PRs. build_binary.py now computes a SHA256
hash over all relevant source files (Python sources, .spec, pyproject.toml,
poetry.lock, copied client-py sources, plus the Python interpreter version)
and caches the dist/ output at ~/.cache/iotdb-ainode-build/dist-cache/. On
a cache hit, the dist is restored directly and PyInstaller is skipped
entirely.
PyInstaller 的 Analysis 阶段会扫描 torch/transformers/numpy 等成千上万个 hidden
import,每次运行需要约 10 分钟,但 AINode 源码在大多数 PR 之间并不变化。build_binary.py
现在会对所有相关源文件(Python 源码、.spec、pyproject.toml、poetry.lock、复制进来的 client-py
源码,加上 Python 解释器版本)计算 SHA256,并将 dist/ 输出缓存至
~/.cache/iotdb-ainode-build/dist-cache/。命中缓存时直接恢复 dist,整个 PyInstaller 阶段秒过。
This works because the pipeline runs on a self-hosted runner (ci-182) where
/root/.cache/ persists across runs.
由于本 pipeline 跑在 self-hosted runner (ci-182) 上,/root/.cache/ 跨运行持久化,缓存可以在 CI
多次调用之间复用。
---
Results / 优化效果
┌─────────────────────────────────────────────┬───────────────────┬────────────────────┐
│ Phase / 阶段 │ Original / 优化前 │ Optimized /
优化后 │
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ Cluster startups (total) / 集群启动(合计) │ 25 min │ 11 min
│
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ PyInstaller packaging / PyInstaller 打包 │ 11 min │ <1 min
(cache hit) │
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ Actual test execution / 实际测试执行 │ 9 min │ 7 min
│
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ Maven build / overhead / Maven 构建及其他 │ 7 min │ 8 min
│
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ Total / 总耗时 │ ~52 min │ ~27 min
│
└─────────────────────────────────────────────┴───────────────────┴────────────────────┘
When AINode source changes (cache miss), the run still benefits from test
consolidation alone and lands around 37 min (-29%).
当 AINode 源码变更时(缓存未命中),仅靠测试合并这一项也能将耗时控制在 ~37 min(-29%)。
Notes / 备注:
- No test was deleted; only the class organization changed. All 18 original
test methods still run.
- 没有删除任何测试,仅做了 class 层面的重组,原有的 18 个测试方法全部保留。
- The cache key includes the Python interpreter version, so interpreter
upgrades invalidate the cache automatically.
- 缓存 key 包含 Python 解释器版本,解释器升级会自动失效缓存。
PR: https://github.com/apache/iotdb/pull/17687
---
What's Next / 下一步预告
With 1C1D1A fixed, the remaining CI bottlenecks (based on recent runs) are:
1C1D1A 优化后,根据近期运行数据,CI 还存在以下瓶颈:
┌─────────────────────────┬───────────────┬─────────────────────────────────────────────────────────┐
│ Workflow │ Avg. Duration │ Main
Bottleneck │
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Cluster IT - 1C1D │ ~89 min │ Windows job (106 min) is 2×
slower than Ubuntu (49 min) │
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Multi-Cluster IT │ ~69 min │
dual-table-manual-basic/enhanced jobs (~64 min each) │
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Table Cluster IT - 1C1D │ ~63 min │ Same Windows slowness as 1C1D
│
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Sonar-Codecov │ ~62 min │ codecov job 64 min vs sonar
only 8 min │
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Unit-Test │ ~51 min │ Windows datanode job (53 min)
│
└─────────────────────────┴───────────────┴─────────────────────────────────────────────────────────┘
┌─────────────────────────┬──────────┬────────────────────────────────────────────────────┐
│ Workflow │ 平均耗时 │ 主要瓶颈
│
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Cluster IT - 1C1D │ ~89 min │ Windows job (106 min) 比 Ubuntu (49
min) 慢 2 倍 │
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Multi-Cluster IT │ ~69 min │ dual-table-manual-basic/enhanced
job(各 ~64 min) │
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Table Cluster IT - 1C1D │ ~63 min │ 与 1C1D 相同的 Windows 慢问题
│
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Sonar-Codecov │ ~62 min │ codecov job 跑 64 min,但 sonar 仅 8 min
│
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Unit-Test │ ~51 min │ Windows 上的 datanode job(53 min)
│
└─────────────────────────┴──────────┴────────────────────────────────────────────────────┘
Possible directions I'm considering:
正在考虑的优化方向:
- Sharding the Windows IT jobs via matrix to parallelize the slowest
Windows runs across multiple GitHub-hosted VMs.
- 将 Windows IT job 通过 matrix 分片 并行到多个 GitHub 托管 VM 上,缓解 Windows 慢的问题。
- Splitting the codecov job or enabling incremental coverage to bring
Sonar-Codecov down to ~10 min.
- 拆分 codecov job 或启用增量覆盖率,让 Sonar-Codecov 降至 ~10 min。
- Further consolidation in Multi-Cluster IT to reduce the two long-running
dual-table jobs.
- 进一步合并 Multi-Cluster IT 中的测试,减少两个长尾 dual-table job 的耗时。
Suggestions, edge cases, and counter-arguments are all very welcome.
欢迎大家提建议、补充注意事项,或者反对意见。
[1] https://github.com/apache/iotdb/pull/17687
Best regards,
Yuan Tian