leonardBang commented on PR #4233:
URL: https://github.com/apache/flink-cdc/pull/4233#issuecomment-4053526825

   > I have reviewd this PR(by @linjianchang )and #4259 by @zml1206 . Thanks 
for both hard work!
   > 
   > I have also done some research on the Debezium codebase and PostgreSQL 
replication, and I have another design idea that prefers to leverage 
PostgreSQL’s native capabilities more fully. So I’d like to discuss it with you.
   > 
   > ### PostgreSQL Pgoutput's relation message in logical replication
   > As document of PostgreSQL 16 says: [Section 55.5.3 - Logical Replication 
Protocol Message 
Flow](https://www.postgresql.org/docs/16/protocol-logical-replication.html):
   > 
   > > "Every DML message contains a relation OID, identifying the publisher's 
relation that was acted on. Before the first DML message for a given relation 
OID, a Relation message will be sent, describing the schema of that relation. 
Subsequently, a new Relation message will be sent if the relation's definition 
has changed since the last Relation message was sent for it."
   > 
   > When DDL changes are executed in PostgreSQL, no corresponding logs are 
generated. However, if the pgoutput sender is about to send the first DML 
message for a new schema, it will first send a Relation message.
   > 
   > The relation message will include the schema of table.
   > 
   > ### how Debezium use it?
   > Depending on The decoder plug-in, schema updates take two completely 
different paths: ● pgoutput: Sends a correlation message before each DML 
event,You can use the `applySchemaChangesForTable` to actively update schema in 
advance. `shouldSchemaBeSynchronized()` returns false, so 
`synchronizeTableSchema() `is an empty operation for DML events. ● decoderbufs: 
If no RELATION message is sent, `shouldSchemaBeSynchronized()` returns true 
(the default value). The schema is synchronized by comparing the message column 
with the in-memory schema in a reactive manner. Thus, if it is pgoutput, we can 
send schema change events on demand without comparing each message?
   > 
   > ### What cdc need to do?
   > Therefore, my personal opinion:
   > 
   > 1. Extend the PostgresSchema and pass in a dispatcher. When a correlation 
message is received,Put the schema in the event queue as a special message
   > 2. When Postgres RecordEmitter receives the schema event, it:
   > 3. Update schemas in split,For persistence (in the current master branch, 
since pg cdc does not have a schema change event, the information in schemas 
will never be updated)
   > 4. Compare table changes and send scheme ddl in yaml.
   > 
   > <img alt="image" width="2000" height="1254" 
src="https://private-user-images.githubusercontent.com/125648852/562848516-89ed012e-867f-46d1-be43-2ce54d464220.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzMzODk4MjAsIm5iZiI6MTc3MzM4OTUyMCwicGF0aCI6Ii8xMjU2NDg4NTIvNTYyODQ4NTE2LTg5ZWQwMTJlLTg2N2YtNDZkMS1iZTQzLTJjZTU0ZDQ2NDIyMC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjYwMzEzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI2MDMxM1QwODEyMDBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03N2QxZWJkYTVkMDM2NWZmNGU3ZjViZjdhMTMzNjk3YTJiOGYwYWJkMGVjYjA3MDAwMjU3NTZmN2VkMjllZmFlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.fgGE6KtPORW-7mBf85X5InwsuHuBH-u7Xv424azBpxE";>
   > In this way, we can avoid intrusive changes to PostgreSQL through 
triggers, and there is also no need to compare schemas for every message. More 
importantly, it updates the schemas stored in the split, so schema consistency 
can still be guaranteed after state recovery or restart.
   > 
   > Though only useful for pgoutput, but I think it's enough. Because, the 
pgoutput is the only official replication plugin.
   > 
   > @linjianchang @zml1206 @leonardBang , WDYT?
   > 
   > here is my poc code: 
https://github.com/loserwang1024/flink-cdc-connectors/tree/poc-schema-change-event
   
   Thanks @loserwang1024 for the detailed investigation!Here are my opinions: 
   
   (1) Querying the PostgreSQL database table for every CDC event is not 
feasible — this approach will directly overwhelm and crash the business 
database once the data volume grows even slightly.
   
   (2) The trigger-based solution is also unacceptable, as it directly intrudes 
into the business write workflow on the database. Downstream CDC subscribers 
may not have the permission to implement this, and requiring all business teams 
to accept such triggers is not realistic.
   
   The approach of subscribing to the latest table schema based on OID in 
real-time is conceptually sound. It is non-intrusive to the business database, 
avoids frequent polling of the upstream DB as traffic increases, and ensures 
the ordering between schema events and data events. Debezium also uses this 
mechanism — +1 to implementing our solution based on this approach.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to