JNSimba opened a new pull request, #64850:
URL: https://github.com/apache/doris/pull/64850
## Proposed changes
### Problem
For the PostgreSQL streaming job (from-to / at-least-once path),
schema-change
(ADD/DROP column) detection was done **per DML record**: every change row's
Kafka
Connect "after" schema was diffed against the stored schema, and on any
divergence
a JDBC round-trip fetched the fresh PG schema to obtain accurate column
types.
This has two drawbacks:
- a name diff runs on the hot path for every DML record;
- accurate column types require an out-of-band JDBC fetch on every detected
change.
### What this PR does
Switch PG schema-change detection to be **event-driven**, sourced from
pgoutput
Relation messages (surfaced as schema-change records on the stream). On each
Relation event the carried full post-change table schema is diffed against
the
stored baseline (the Doris table's current schema, loaded from FE) to derive
ADD/DROP column DDL. The accurate column type / nullability / default come
from
the Relation-carried schema, so the per-record diff and the JDBC fetch are
both
removed.
Behavior preserved:
- **Baseline** is established by the table's first Relation event (covers
streams
that start directly from an offset without a snapshot); no DDL is emitted
for it.
- **Rename guard**: a simultaneous ADD+DROP is treated as a possible RENAME
and no
DDL is emitted, to avoid data loss; the column must be renamed manually in
Doris.
- **Excluded columns** are skipped for both ADD and DROP.
- **NOT NULL without a usable default** is added as NULLABLE (incoming DML
still
carries the real values).
- The DDL is applied only on the from-to write path (unchanged; TVF mode
does not
apply schema-change records).
Default-value handling (`stripPgDefault`) is best-effort: string / numeric /
boolean literals and `now()/current_timestamp/localtimestamp` are mapped; any
other expression degrades to no static default rather than emitting a wrong
DEFAULT clause.
Changes:
- `PostgresDebeziumJsonDeserializer`: event-driven `handleSchemaChangeEvent`,
replacing the per-DML diff and the JDBC schema refresher.
- `JdbcIncrementalSourceReader`: pass schema-change records through to the
deserializer without advancing the offset.
- `PostgresSourceReader`: enable schema-change records on the source; drop
the now
unused JDBC schema-refresher injection.
- `SchemaChangeHelper`: remove the now-unused name-only diff helper.
### Tests
- Unit tests cover: baseline establishment, no-op idempotency, ADD, DROP, the
ADD+DROP rename guard, excluded-column ADD/DROP skipping, and default-value
parsing (parenthesised/`::`-containing string literals,
unrecognized-keyword
downgrade).
- End-to-end ADD/DROP/DEFAULT/NOT NULL regression suites for the PG
streaming job.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]