leonardBang commented on PR #4233: URL: https://github.com/apache/flink-cdc/pull/4233#issuecomment-4053526825
> I have reviewd this PR(by @linjianchang )and #4259 by @zml1206 . Thanks for both hard work! > > I have also done some research on the Debezium codebase and PostgreSQL replication, and I have another design idea that prefers to leverage PostgreSQL’s native capabilities more fully. So I’d like to discuss it with you. > > ### PostgreSQL Pgoutput's relation message in logical replication > As document of PostgreSQL 16 says: [Section 55.5.3 - Logical Replication Protocol Message Flow](https://www.postgresql.org/docs/16/protocol-logical-replication.html): > > > "Every DML message contains a relation OID, identifying the publisher's relation that was acted on. Before the first DML message for a given relation OID, a Relation message will be sent, describing the schema of that relation. Subsequently, a new Relation message will be sent if the relation's definition has changed since the last Relation message was sent for it." > > When DDL changes are executed in PostgreSQL, no corresponding logs are generated. However, if the pgoutput sender is about to send the first DML message for a new schema, it will first send a Relation message. > > The relation message will include the schema of table. > > ### how Debezium use it? > Depending on The decoder plug-in, schema updates take two completely different paths: ● pgoutput: Sends a correlation message before each DML event,You can use the `applySchemaChangesForTable` to actively update schema in advance. `shouldSchemaBeSynchronized()` returns false, so `synchronizeTableSchema() `is an empty operation for DML events. ● decoderbufs: If no RELATION message is sent, `shouldSchemaBeSynchronized()` returns true (the default value). The schema is synchronized by comparing the message column with the in-memory schema in a reactive manner. Thus, if it is pgoutput, we can send schema change events on demand without comparing each message? > > ### What cdc need to do? > Therefore, my personal opinion: > > 1. Extend the PostgresSchema and pass in a dispatcher. When a correlation message is received,Put the schema in the event queue as a special message > 2. When Postgres RecordEmitter receives the schema event, it: > 3. Update schemas in split,For persistence (in the current master branch, since pg cdc does not have a schema change event, the information in schemas will never be updated) > 4. Compare table changes and send scheme ddl in yaml. > > <img alt="image" width="2000" height="1254" src="https://private-user-images.githubusercontent.com/125648852/562848516-89ed012e-867f-46d1-be43-2ce54d464220.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzMzODk4MjAsIm5iZiI6MTc3MzM4OTUyMCwicGF0aCI6Ii8xMjU2NDg4NTIvNTYyODQ4NTE2LTg5ZWQwMTJlLTg2N2YtNDZkMS1iZTQzLTJjZTU0ZDQ2NDIyMC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjYwMzEzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI2MDMxM1QwODEyMDBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03N2QxZWJkYTVkMDM2NWZmNGU3ZjViZjdhMTMzNjk3YTJiOGYwYWJkMGVjYjA3MDAwMjU3NTZmN2VkMjllZmFlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.fgGE6KtPORW-7mBf85X5InwsuHuBH-u7Xv424azBpxE"> > In this way, we can avoid intrusive changes to PostgreSQL through triggers, and there is also no need to compare schemas for every message. More importantly, it updates the schemas stored in the split, so schema consistency can still be guaranteed after state recovery or restart. > > Though only useful for pgoutput, but I think it's enough. Because, the pgoutput is the only official replication plugin. > > @linjianchang @zml1206 @leonardBang , WDYT? > > here is my poc code: https://github.com/loserwang1024/flink-cdc-connectors/tree/poc-schema-change-event Thanks @loserwang1024 for the detailed investigation!Here are my opinions: (1) Querying the PostgreSQL database table for every CDC event is not feasible — this approach will directly overwhelm and crash the business database once the data volume grows even slightly. (2) The trigger-based solution is also unacceptable, as it directly intrudes into the business write workflow on the database. Downstream CDC subscribers may not have the permission to implement this, and requiring all business teams to accept such triggers is not realistic. The approach of subscribing to the latest table schema based on OID in real-time is conceptually sound. It is non-intrusive to the business database, avoids frequent polling of the upstream DB as traffic increases, and ensures the ordering between schema events and data events. Debezium also uses this mechanism — +1 to implementing our solution based on this approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
