DanielLeens commented on PR #10667:
URL: https://github.com/apache/seatunnel/pull/10667#issuecomment-4178090521
<!-- review-agent -->
# PR 解决了什么问题
- `用户痛点`: In MySQL-CDC pipelines, users often need to trace each change back
to **binlog coordinates / GTID / source commit time** for auditing, debugging,
idempotency, or downstream ordering. Today, SeaTunnel’s `Metadata` transform
can surface basic metadata (e.g., `EventTime`, `Delay`), but cannot directly
expose `source.ts_ms`, `binlog file/pos/row`, or `gtid` as regular columns.
- `修复方式`: This PR adds new metadata keys (`SourceTimestamp`, `BinlogFile`,
`BinlogPos`, `BinlogRow`, `Gtid`) to the SeaTunnel API, declares them as
`MetadataSchema` columns for Debezium-based CDC sources, and populates them
during `SourceRecord -> SeaTunnelRow` deserialization (via
`SeaTunnelRowDebeziumDeserializeSchema`) by writing into
`SeaTunnelRow.getOptions()`. Docs (EN/ZH) and unit tests for extraction helpers
are added/updated.
- `一句话`: Expose Debezium CDC source metadata (`source.ts_ms` + MySQL
binlog/GTID) as SeaTunnel metadata fields so `Metadata` transform can turn them
into real columns.
# 一、代码变更审查
## 1.1 核心逻辑分析
#### 变更内容精确描述
Core changes are concentrated in:
- **API layer**
- Add new metadata keys in
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/CommonOptions.java`
(e.g., `BINLOG_FILE`, `GTID`, `SOURCE_TIMESTAMP`).
- Add setters in
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/MetadataUtil.java`
to write these keys into `SeaTunnelRow.options`.
- **CDC connector base (Debezium-based)**
- Declare the new metadata columns in produced `MetadataSchema` in
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/base/source/IncrementalSource.java`
(`getMetadataColumns()`).
- Extract MySQL binlog/gtid fields from Debezium `SourceRecord` in
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/base/utils/SourceRecordUtils.java`.
- Populate row options in
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/debezium/row/SeaTunnelRowDebeziumDeserializeSchema.java`
(`deserializeDataChangeRecord()`).
- **Docs & tests**
- Update `docs/en/transforms/metadata.md` and
`docs/zh/transforms/metadata.md`.
- Add `SourceRecordUtilsTest` under `connector-cdc-base` tests.
#### 修改前代码片段
1) Metadata schema for CDC only had `EventTime` and `Delay`:
```java
// origin/dev
//
seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/base/source/IncrementalSource.java:153
private MetadataSchema getMetadataColumns() {
List<Column> metadata = new ArrayList<>();
metadata.add(MetadataColumn.of(CommonOptions.EVENT_TIME.getName(),
BasicType.LONG_TYPE, (Long) null, true, null, null));
metadata.add(MetadataColumn.of(CommonOptions.DELAY.getName(),
BasicType.LONG_TYPE, (Long) null, true, null, null));
return MetadataSchema.builder().columns(metadata).build();
}
```
2) Debezium rows only got `Delay` and `EventTime` written into options:
```java
// origin/dev
//
seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/debezium/row/SeaTunnelRowDebeziumDeserializeSchema.java:221
MetadataUtil.setDelay(insert, delay);
MetadataUtil.setEventTime(insert, fetchTimestamp);
collector.collect(insert);
```
#### 修改后代码片段
1) CDC produced metadata schema now declares 5 extra fields:
```java
//
seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/base/source/IncrementalSource.java:153
metadata.add(MetadataColumn.of(CommonOptions.BINLOG_FILE.getName(),
BasicType.STRING_TYPE, (Long) null, true, null, null));
metadata.add(MetadataColumn.of(CommonOptions.BINLOG_POS.getName(),
BasicType.LONG_TYPE, (Long) null, true, null, null));
metadata.add(MetadataColumn.of(CommonOptions.BINLOG_ROW.getName(),
BasicType.INT_TYPE, (Long) null, true, null, null));
metadata.add(MetadataColumn.of(CommonOptions.GTID.getName(),
BasicType.STRING_TYPE, (Long) null, true, null, null));
metadata.add(MetadataColumn.of(CommonOptions.SOURCE_TIMESTAMP.getName(),
BasicType.LONG_TYPE, (Long) null, true, null, null));
```
2) Debezium deserialization now extracts and sets metadata:
```java
//
seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/debezium/row/SeaTunnelRowDebeziumDeserializeSchema.java:227
String binlogFile = SourceRecordUtils.getBinlogFile(record);
Long binlogPos = SourceRecordUtils.getBinlogPos(record);
Integer binlogRow = SourceRecordUtils.getBinlogRow(record);
String gtid = SourceRecordUtils.getGtid(record);
// ... later per emitted row
// SeaTunnelRowDebeziumDeserializeSchema.java:238
MetadataUtil.setSourceTimestamp(insert, messageTimestamp);
MetadataUtil.setBinlogFile(insert, binlogFile);
MetadataUtil.setBinlogPos(insert, binlogPos);
MetadataUtil.setBinlogRow(insert, binlogRow);
MetadataUtil.setGtid(insert, gtid);
```
#### 关键发现
- **Main path hit (DEFAULT format)**: For Debezium-based CDC connectors
using `format = DEFAULT`, normal change events hit
`SeaTunnelRowDebeziumDeserializeSchema.deserializeDataChangeRecord()` and will
populate the new metadata into `SeaTunnelRow.options`
(`SeaTunnelRowDebeziumDeserializeSchema.java:204-279`).
- **Main path NOT hit (COMPATIBLE_DEBEZIUM_JSON)**: If the CDC connector
runs with `format = COMPATIBLE_DEBEZIUM_JSON`, this PR’s new metadata fields
are **not** populated (it uses `DebeziumJsonDeserializeSchema`, and
`IncrementalSource` skips `CatalogTable.withMetadata`)
(`IncrementalSource.java:139-149`, `DebeziumJsonDeserializeSchema.java:62-79`).
- **MySQL-specific extraction**: `Binlog*` / `Gtid` are extracted from
Debezium MySQL `source` struct fields (`SourceRecordUtils.java:235-302`).
Non-MySQL Debezium connectors will typically see `null`.
- **Backward-compatibility risk**: New `CommonOptions` enum constants are
inserted **in the middle**, shifting ordinals for existing constants
(`CommonOptions.java:59-98`). This can break binary compatibility for
third-party plugins compiled against older versions.
#### 逻辑正确性深度分析
- The Debezium CDC runtime flow for DEFAULT format is:
- `IncrementalSourceRecordEmitter.emitElement()` forwards each
`SourceRecord` into the Debezium deserializer
(`IncrementalSourceRecordEmitter.java:199-202`).
- `SeaTunnelRowDebeziumDeserializeSchema.deserialize()` routes only
**data-change** records into `deserializeDataChangeRecord()`
(`SeaTunnelRowDebeziumDeserializeSchema.java:99-122`,
`SeaTunnelRowDebeziumDeserializeSchema.java:204-283`).
- `deserializeDataChangeRecord()` computes:
- `fetchTimestamp = payload.ts_ms`
(`SourceRecordUtils.getFetchTimestamp`, `SourceRecordUtils.java:93-100`)
- `messageTimestamp = payload.source.ts_ms`
(`SourceRecordUtils.getMessageTimestamp`, `SourceRecordUtils.java:74-87`)
- `delay = fetchTimestamp - messageTimestamp` (stored as `Delay`)
- MySQL-only metadata via `getBinlogFile/Pos/Row/getGtid`
(`SourceRecordUtils.java:235-302`)
- Then it sets these values into each emitted `SeaTunnelRow`’s `options`
map via `MetadataUtil.*` (`MetadataUtil.java:49-79`).
- When will it **not** take effect?
- Heartbeat records are not emitted by this deserializer (it only checks
`isDataChangeRecord` and schema change events)
(`SeaTunnelRowDebeziumDeserializeSchema.java:116-122`).
- `COMPATIBLE_DEBEZIUM_JSON` format uses `DebeziumJsonDeserializeSchema`
which does not populate row options for these keys
(`DebeziumJsonDeserializeSchema.java:62-69`).
- How does the user actually surface these fields?
- `MetadataTransform` reads values (for non-special keys) directly from
`inputRow.getOptions().get(<MetadataKey>)`
(`seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/metadata/MetadataTransform.java:81-103`).
- It also validates that the input `CatalogTable.metadataSchema` contains
the metadata column (`MetadataTransform.java:129-139`). This is why
`IncrementalSource.getMetadataColumns()` needs to declare the new columns.
#### 完整的 CDC metadata 链路
```text
Debezium emits SourceRecord
-> IncrementalSourceRecordEmitter.emitElement()
[connector-cdc-base/.../IncrementalSourceRecordEmitter.java:199]
-> DebeziumDeserializationSchema.deserialize(record, collector)
-> SeaTunnelRowDebeziumDeserializeSchema.deserialize()
[connector-cdc-base/.../SeaTunnelRowDebeziumDeserializeSchema.java:99]
-> deserializeDataChangeRecord()
[SeaTunnelRowDebeziumDeserializeSchema.java:204]
-> SourceRecordUtils.getFetchTimestamp()
[SourceRecordUtils.java:93]
-> SourceRecordUtils.getMessageTimestamp()
[SourceRecordUtils.java:74]
-> SourceRecordUtils.getBinlogFile/Pos/Row/getGtid()
[SourceRecordUtils.java:235]
-> MetadataUtil.set*() ->
SeaTunnelRow.getOptions().put(...) [MetadataUtil.java:49; SeaTunnelRow.java:77]
User config wants to output metadata as columns
-> MetadataTransform.getOutputColumns() checks
metadataSchema.contains(...) [MetadataTransform.java:129]
-> MetadataTransform.getOutputFieldValues() reads
inputRow.getOptions().get("BinlogPos"/...) [MetadataTransform.java:97-100]
```
#### 问题 1:`CommonOptions` 中间插入 enum 常量导致潜在二进制兼容性破坏
- 位置:
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/CommonOptions.java:59`
到
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/CommonOptions.java:98`
- 问题描述: This PR inserts new enum constants (`BINLOG_*`, `GTID`,
`SOURCE_TIMESTAMP`) **between** `DELAY` and `IS_COMPLETE`. Java enum `switch`
bytecode relies on ordinals via a synthetic mapping array. If any third-party
plugin (transform/connector) compiled against an older `seatunnel-api` uses
`switch(CommonOptions)`, the shifted ordinals can cause runtime failures (e.g.,
`ArrayIndexOutOfBoundsException`) when running with the newer `seatunnel-api`.
- 潜在风险: Hard runtime crash for external plugins (SPI ecosystem), violating
the “backward compatibility” constraint for public API types.
- 最佳改进建议:
- 方案 A(推荐): Append new constants **at the end** of `CommonOptions` to
preserve existing ordinals.
- 方案 B: Stop exposing these as enum constants for extension points;
introduce a separate class of stable `String` constants for metadata keys, and
keep `CommonOptions` ordering stable.
- 严重程度: 高
#### 问题 2:快照行(`startup.mode = initial`)的 BinlogPos/BinlogRow “应为
null”契约未被代码保证,且与文档表述不一致
- 位置:
- Extractors:
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/base/utils/SourceRecordUtils.java:255`
到
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/base/utils/SourceRecordUtils.java:281`
- Population:
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/debezium/row/SeaTunnelRowDebeziumDeserializeSchema.java:227`
到
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/debezium/row/SeaTunnelRowDebeziumDeserializeSchema.java:279`
- Docs: `docs/en/transforms/metadata.md:30` 到
`docs/en/transforms/metadata.md:41` and `docs/zh/transforms/metadata.md:29` 到
`docs/zh/transforms/metadata.md:41`
- Test assumption:
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/test/java/org/apache/seatunnel/connectors/cdc/base/utils/SourceRecordUtilsTest.java:110`
到
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/test/java/org/apache/seatunnel/connectors/cdc/base/utils/SourceRecordUtilsTest.java:123`
- 问题描述:
- Runtime flow: Snapshot rows enter `deserializeDataChangeRecord()` as
`Envelope.Operation.READ`
(`SeaTunnelRowDebeziumDeserializeSchema.java:232-244`), so they will also have
binlog metadata extracted and written into options.
- Contract mismatch: Docs claim snapshot rows return `null` for **all
four** fields (`BinlogFile`, `BinlogPos`, `BinlogRow`, `Gtid`)
(`docs/en/transforms/metadata.md:41`). Code only normalizes empty string for
`file`/`gtid` (`SourceRecordUtils.java:244-248`,
`SourceRecordUtils.java:297-301`). `pos` and `row` are returned as-is when the
schema field exists (`SourceRecordUtils.java:255-265`,
`SourceRecordUtils.java:271-281`), meaning snapshot rows can surface `0` or
non-null offsets depending on Debezium behavior.
- 潜在风险:
- Users may filter snapshot rows by `null` expectation; receiving
`0`/non-null breaks correctness.
- Downstream ordering / idempotency logic may treat snapshot offsets as
valid binlog coordinates.
- Documentation becomes unreliable (support burden).
- 最佳改进建议:
- 方案 A(推荐): Detect snapshot explicitly (e.g., check
`source.schema().field("snapshot")` and its value) and force **all**
binlog/gtid fields to `null` when snapshot; alternatively, if `file` is
empty/null then also null out `pos` and `row`.
- 方案 B: If Debezium snapshot events legitimately carry `file/pos`, update
docs to reflect actual behavior and add a separate metadata key such as
`IsSnapshot` to let users branch reliably.
- 严重程度: 高
#### 问题 3:`SourceTimestamp`/`Delay` 的“CDC connectors”范围表述可能过宽(尤其非 Debezium
CDC / 非 DEFAULT format)
- 位置:
- Javadoc:
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/CommonOptions.java:83`
到
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/CommonOptions.java:88`
- Docs: `docs/en/transforms/metadata.md:27` 到
`docs/en/transforms/metadata.md:41` and `docs/zh/transforms/metadata.md:27` 到
`docs/zh/transforms/metadata.md:41`
- Metadata schema is only attached for DEFAULT format:
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/base/source/IncrementalSource.java:143`
到
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/base/source/IncrementalSource.java:148`
- 问题描述:
- Runtime flow: `MetadataTransform` first checks key validity via
`MetadataUtil.isMetadataField()` (global `CommonOptions` list), then requires
`inputCatalogTable.getMetadataSchema().contains(key)`
(`MetadataTransform.java:61-63`, `MetadataTransform.java:129-139`).
- Docs/Javadoc say `SourceTimestamp` is available for “CDC connectors” /
“all CDC connectors”, but implementation is currently in Debezium-based CDC
default deserialization only. Other CDC implementations (e.g., TiDB CDC) and
Debezium JSON format may not populate or even declare this metadata, leading to
runtime failures or always-null outputs.
- 潜在风险: Misconfiguration leading to transform init failure (schema contains
check), inconsistent user experience across CDC families.
- 最佳改进建议:
- Clarify docs/Javadoc scope to “Debezium-based CDC connectors (DEFAULT
format)”.
- Or implement equivalent fields for other CDC connectors to match docs
(larger scope).
- 严重程度: 中
#### 问题 4:测试未覆盖“快照 null 契约(pos/row)”与端到端写入 `SeaTunnelRow.options` 的链路
- 位置:
- Snapshot test only asserts file is null:
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/test/java/org/apache/seatunnel/connectors/cdc/base/utils/SourceRecordUtilsTest.java:110`
到
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/test/java/org/apache/seatunnel/connectors/cdc/base/utils/SourceRecordUtilsTest.java:123`
- No test for
`SeaTunnelRowDebeziumDeserializeSchema.deserializeDataChangeRecord()` wiring
(no dedicated test file).
- 问题描述:
- Runtime flow requires: extractor -> deserializer -> MetadataUtil ->
`SeaTunnelRow.options` -> MetadataTransform. Current unit tests only validate
helper extraction under synthetic schemas and do not lock down the user-facing
behavior described in docs.
- 潜在风险: Regressions slip through (e.g., metadata not set, schema not
declared, snapshot behavior drifting), docs contract not enforced.
- 最佳改进建议:
- Add tests asserting `BinlogPos`/`BinlogRow` behavior for snapshot rows
(either `null` or documented alternative).
- Add a unit test that builds a minimal Debezium envelope and verifies
`SeaTunnelRowDebeziumDeserializeSchema` writes `SourceTimestamp/Binlog*/Gtid`
into options for CREATE/UPDATE/READ.
- 严重程度: 中
#### 问题 5:不必要的 per-row 开销(对非 MySQL CDC 仍写入一堆 null metadata)
- 位置:
- Extraction + set:
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/debezium/row/SeaTunnelRowDebeziumDeserializeSchema.java:227`
到
`seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/debezium/row/SeaTunnelRowDebeziumDeserializeSchema.java:279`
- Always-put setters:
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/MetadataUtil.java:61`
到
`seatunnel-api/src/main/java/org/apache/seatunnel/api/table/type/MetadataUtil.java:79`
- 问题描述:
- Runtime flow: every emitted row gets `put(key, value)` even when `value
== null`. For Postgres/Oracle CDC, binlog/gtid fields are expected to be always
null, but the options map still grows and receives extra puts per row.
- 潜在风险: Extra CPU + memory + GC overhead on high-throughput CDC jobs;
unnecessary payload growth if options are serialized in some execution modes.
- 最佳改进建议:
- Only call `set*` when value is non-null (missing key and `null` are
equivalent for `Map#get`).
- Consider moving binlog/gtid extraction behind MySQL-specific code path.
- 严重程度: 低
## 1.2 兼容性影响
- **结论**: `部分不兼容`(主要是二进制兼容性风险,见问题 1);其余变更为“新增能力”,对配置项/默认值无破坏性变更。
- API / SPI:
- New metadata keys are additive, but inserting enum constants in the
middle can break binary compatibility for external plugins
(`CommonOptions.java:59-98`) — **must treat as a compatibility risk**.
- 配置与协议:
- No config keys renamed/removed; metadata keys are new and optional.
- Checkpoint / state / serialization:
- No explicit protocol/serialization format changes are introduced by this
PR; metadata is carried in `SeaTunnelRow.options` via in-memory map.
## 1.3 性能 / 副作用分析
- Per-row overhead increases: extra extractor calls + extra `Map.put`
operations (`SeaTunnelRowDebeziumDeserializeSchema.java:227-278`), including
null values for non-MySQL CDC (问题 5).
- If a job never uses these metadata fields, it still pays cost today (no
opt-in gating).
- Recommendation: at minimum skip putting null keys; consider MySQL-only
gating if performance is a priority.
## 1.4 错误处理与日志
- Positive: `getMessageTimestamp` now avoids a potential NPE by checking
`source == null` (`SourceRecordUtils.java:81-84`), improving robustness.
- No sensitive data is logged by this PR.
- No additional error-handling paths were introduced; extractor methods
fail-safe to `null` when schema fields are absent
(`SourceRecordUtils.java:235-302`).
# 二、代码质量评估
## 2.1 代码规范
- Overall style matches existing code (Spotless formatting looks consistent
in diffs).
- New helper methods are small and readable.
- Minor clarity: `MetadataUtil.setEventTime(SeaTunnelRow row, Long delay)`
naming is confusing but pre-existing (`MetadataUtil.java:57-59`).
## 2.2 测试覆盖
- Added unit test file `SourceRecordUtilsTest` verifies binlog
file/pos/row/gtid extractors under synthetic schemas.
- Missing coverage for snapshot contract on `pos/row` and missing end-to-end
test of deserializer wiring (问题 4).
#### 本地验证结果
| Command | Result | Notes |
|---|---|---|
| `git fetch origin pull/10667/head:pr-10667` | PASS | PR fetched to local
branch `pr-10667` |
| `git switch pr-10667` | PASS | Switched working tree to PR branch |
| `git merge-tree $(git merge-base origin/dev pr-10667) origin/dev pr-10667`
| PASS | No conflict markers found; no merge conflicts detected vs `origin/dev`
|
| `./mvnw spotless:apply` | FAIL | `java.io.IOException: Operation not
permitted` (also fails creating Jansi temp lock under `/var/tmp/...`) |
| `./mvnw -q -DskipTests verify` | FAIL | `maven-remote-resources-plugin`
cannot create `target/.plxarc: Operation not permitted` |
| `./mvnw test` | FAIL | Same `target/.plxarc: Operation not permitted`
filesystem restriction |
> Important: The sandbox environment prevents creating files under `target/`
and temp dirs, so Maven verification cannot be completed here. This is **not**
proof the PR fails CI; it is a limitation of the local execution environment.
## 2.3 文档更新
- Good: EN/ZH docs updated consistently with the same new keys
(`docs/en/transforms/metadata.md:29-34`,
`docs/zh/transforms/metadata.md:29-33`).
- Risk: Some statements are likely too strong / not enforced by code
(snapshot “null for all four fields”, and “CDC connectors” scope) — see 问题 2 /
问题 3.
# 三、架构合理性
## 3.1 解决方案的优雅性
- This is a **precise, incremental extension** of the existing metadata
mechanism (SeaTunnelRow options + MetadataTransform).
- However, it mixes MySQL-specific metadata into a Debezium CDC base path
(schema declares them broadly; values become connector-dependent).
## 3.2 可维护性
- Centralizing metadata keys in `CommonOptions` and setters in
`MetadataUtil` improves consistency.
- Extractors in `SourceRecordUtils` are simple, but the snapshot semantics
are not explicitly modeled (问题 2), which may become a maintenance hotspot.
## 3.3 扩展性
- This pattern can be extended to other connectors’ offsets (e.g., Postgres
LSN, Oracle SCN), but current naming is MySQL-centric.
- Consider a future design where connector-specific metadata is
declared/published per connector rather than in a shared base (avoid “advertise
but always-null” behavior).
## 3.4 历史版本兼容性
- **Not fully proven compatible** due to enum ordinal change risk (问题 1).
- For running jobs upgrading with external plugins: must avoid breaking
binary compatibility at the API level.
# 四、问题汇总
| 序号 | 问题 | 位置 | 严重程度 |
|---:|---|---|---|
| 1 | Enum 常量中间插入导致潜在二进制兼容性破坏 | `seatunnel-api/.../CommonOptions.java:59` |
高 |
| 2 | 快照行 BinlogPos/BinlogRow “应为 null”契约未被代码保证且文档不一致 |
`SourceRecordUtils.java:255`, `docs/en/transforms/metadata.md:41` | 高 |
| 3 | 文档/Javadoc 对“CDC connectors”范围表述过宽(非 Debezium CDC / 非 DEFAULT format)
| `CommonOptions.java:83`, `IncrementalSource.java:143`,
`docs/en/transforms/metadata.md:39` | 中 |
| 4 | 测试未覆盖快照 pos/row 与端到端 options 写入链路 | `SourceRecordUtilsTest.java:110` |
中 |
| 5 | 对非 MySQL CDC 仍写入大量 null metadata 带来不必要开销 |
`SeaTunnelRowDebeziumDeserializeSchema.java:227` | 低 |
# 五、是否可以 Merge 的结论
### 结论:修复后可以 merge
阻塞项(必须修复)
- 问题 1(高): Enum 常量中间插入导致二进制兼容性风险
- 潜在风险: Third-party plugins compiled against older `seatunnel-api` may
crash at runtime due to enum ordinal shift.
- 最佳改进建议:
- 方案 A(推荐): Move new constants to the end of `CommonOptions` to preserve
existing ordinals.
- 方案 B: Introduce stable string constants for metadata keys and keep the
enum ordering immutable.
- 问题 2(高): Snapshot rows contract mismatch for `BinlogPos/BinlogRow` (and
potentially `BinlogFile/Gtid`)
- 潜在风险: Users relying on `null` to identify snapshot rows may get
`0`/non-null values; correctness issues and docs unreliability.
- 最佳改进建议:
- 方案 A(推荐): Implement explicit snapshot detection (`source.snapshot`)
and enforce null for all four fields on snapshot rows.
- 方案 B: If Debezium snapshot rows legitimately have offsets, update docs
+ add `IsSnapshot` metadata key and clearly document expected behavior.
建议修复(非阻塞)
- 问题 3(中): Tighten docs/Javadoc scope to “Debezium-based CDC (DEFAULT
format)” to prevent misconfiguration for other CDC connectors.
- 问题 4(中): Add tests covering snapshot pos/row behavior and deserializer
wiring.
- 问题 5(低): Skip writing null metadata keys to reduce per-row overhead;
optionally gate MySQL-only fields.
整体评价
- Direction is correct and aligns with SeaTunnel’s existing metadata
architecture (options + Metadata transform). The implementation is
straightforward and readable.
- Merge risk is mainly **compatibility (enum ordinals)** and **user-facing
contract accuracy (snapshot behavior)**. Once those are addressed and CI
passes, this should be a solid, user-visible enhancement rather than a
temporary workaround.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]