Copilot commented on code in PR #4247: URL: https://github.com/apache/flink-cdc/pull/4247#discussion_r2851806173
########## docs/content/docs/connectors/flink-sources/mongodb-cdc.md: ########## @@ -512,6 +512,63 @@ Applications can use change streams to subscribe to all data changes on a single By the way, Debezium's MongoDB change streams exploration mentioned by [DBZ-435](https://issues.redhat.com/browse/DBZ-435) is on roadmap.<br> If it's done, we can consider integrating two kinds of source connector for users to choose. +### Scan Newly Added Tables + +**Note:** This feature is available since Flink CDC 3.1.0. + +The Scan Newly Added Tables feature enables you to add new collections to monitor for existing running pipeline. The newly added collections will read their snapshot data firstly and then read their change stream automatically. + +Imagine this scenario: At the beginning, a Flink job monitors collections `[product, user, address]`, but after some days we would like the job can also monitor collections `[order, custom]` which contain history data, and we need the job can still reuse existing state of the job. This feature can resolve this case gracefully. + +The following operations show how to enable this feature to resolve above scenario. An existing Flink job which uses MongoDB CDC Source like: + +```java + MongoDBSource<String> mongoSource = MongoDBSource.<String>builder() + .hosts("yourHostname:27017") + .databaseList("db") // set captured database + .collectionList("db.product", "db.user", "db.address") // set captured collections + .username("yourUsername") + .password("yourPassword") + .scanNewlyAddedTableEnabled(true) // enable scan the newly added tables feature + .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String + .build(); + // your business code +``` + +If we would like to add new collections `[order, custom]` to an existing Flink job, we just need to update the `collectionList()` value of the job to include `[order, custom]` and restore the job from previous savepoint. Review Comment: This example says new collections are `[order, custom]`, but the code uses fully-qualified names in `collectionList()` (e.g., `db.order`, `db.custom`). To avoid ambiguity, please update the text to use the same fully-qualified collection names. ########## docs/content/docs/connectors/flink-sources/postgres-cdc.md: ########## @@ -511,6 +511,71 @@ The config option `scan.startup.mode` specifies the startup mode for PostgreSQL - `committed-offset`: Skip snapshot phase and start reading events from a `confirmed_flush_lsn` offset of replication slot. - `snapshot`: Only the snapshot phase is performed and exits after the snapshot phase reading is completed. +### Scan Newly Added Tables + +**Note:** This feature is available since Flink CDC 3.1.0. + +Scan Newly Added Tables feature enables you to add new tables to monitor for existing running pipeline. The newly added tables will read their snapshot data firstly and then read their WAL (Write-Ahead Log) or replication slot changes automatically. + +Imagine this scenario: At the beginning, a Flink job monitors tables `[product, user, address]`, but after some days we would like the job can also monitor tables `[order, custom]` which contain history data, and we need the job can still reuse existing state of the job. This feature can resolve this case gracefully. Review Comment: This sentence is ungrammatical ("we would like the job can also monitor"). Please rephrase to something like "we would like the job to also monitor ..." to improve readability. ```suggestion Imagine this scenario: At the beginning, a Flink job monitors tables `[product, user, address]`, but after some days we would like the job to also monitor tables `[order, custom]` which contain historical data, and we need the job to still reuse the existing state of the job. This feature can resolve this case gracefully. ``` ########## docs/content/docs/connectors/flink-sources/mongodb-cdc.md: ########## @@ -512,6 +512,63 @@ Applications can use change streams to subscribe to all data changes on a single By the way, Debezium's MongoDB change streams exploration mentioned by [DBZ-435](https://issues.redhat.com/browse/DBZ-435) is on roadmap.<br> If it's done, we can consider integrating two kinds of source connector for users to choose. +### Scan Newly Added Tables Review Comment: The section title says "Tables", but MongoDB terminology throughout this section is "collections". Consider renaming the heading (or using "Tables/Collections") to avoid confusing MongoDB users. ```suggestion ### Scan Newly Added Collections ``` ########## docs/content/docs/connectors/flink-sources/oracle-cdc.md: ########## @@ -559,6 +559,67 @@ _Note: the mechanism of `scan.startup.mode` option relying on Debezium's `snapsh The Oracle CDC source can't work in parallel reading, because there is only one task can receive change events. +### Scan Newly Added Tables + +**Note:** This feature is available since Flink CDC 3.1.0. + +Scan Newly Added Tables feature enables you to add new tables to monitor for an existing running pipeline. The newly added tables will read their snapshot data first and then read their redo log automatically. + +Imagine this scenario: At the beginning, a Flink job monitors tables `[product, user, address]`, but after some days we would like the job can also monitor tables `[order, custom]` which contain history data, and we need the job can still reuse existing state of the job. This feature can resolve this case gracefully. Review Comment: This sentence is ungrammatical ("we would like the job can also monitor"). Please rephrase to something like "we would like the job to also monitor ...". ```suggestion Imagine this scenario: At the beginning, a Flink job monitors tables `[product, user, address]`, but after some days we would like the job to also monitor tables `[order, custom]` which contain historical data, and we need the job to still reuse the existing state. This feature can resolve this case gracefully. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
