[I] [SUPPORT] Use Case Validation for CDC Ingestion from Kafka/Debezium (SQL Server) into Apache Hudi [hudi]

via GitHub Sun, 21 Jul 2024 10:17:55 -0700


YousifS7 opened a new issue, #11662:
URL: https://github.com/apache/hudi/issues/11662


   Dear Hudi Team,
   
   We have a data infrastructure that involves SQL Server CDC-enabled tables 
being ingested by Debezium into Kafka compacted topics. 
   We recently started experimenting with 
`org.apache.hudi.utilities.streamer.HoodieStreamer`. We were able to build a 
pipeline using `HoodieStreamer` to ingest CDC data out of Kafka and write them 
incrementally into Apache Hudi tables.
   
   We discovered that, currently, there is no `DebeziumSource` or 
`DebeziumAvroPayload` implementation available for SQL Server-based Debezium 
sources. While we recognize that it is possible to develop custom 
implementations for these classes, we have instead explored alternative 
approaches using existing classes and functionalities.
   We would like to confirm whether this alternative approach is valid and if 
it will effectively maintain data integrity and consistency.
   
   **Setup Details:**
   
   In order to substitute for SQL Server based `DebeziumSource` and 
`DebeziumAvroPayload`, we utilized SQL `Transformer` 
(`org.apache.hudi.utilities.transform.SqlFileBasedTransformer`).
   
   > Spark-Submit
   ```
   spark-submit
   --packages 
org.apache.hudi:hudi-spark3.5-bundle_2.12:0.15.0,org.apache.hudi:hudi-aws-bundle:0.15.0
 
   --conf spark.streaming.kafka.allowNonConsecutiveOffsets=true 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
   --class org.apache.hudi.utilities.streamer.HoodieStreamer 
s3://some_bucket/libs/hudi-utilities-slim-bundle.jar 
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider 
   --props s3://some_bucket/configs/some_table_name.properties 
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource 
   --enable-sync 
   --source-ordering-field ts_ms 
   --target-base-path s3://some_bucket/some_folder/some_table_name 
   --target-table some_table_name 
   --table-type COPY_ON_WRITE 
   --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool 
   --op UPSERT 
   --transformer-class 
org.apache.hudi.utilities.transform.SqlFileBasedTransformer
   ```
   
   > Properties File
   ```
   
hoodie.streamer.transformer.sql.file=s3://some_bucket/configs/some_table_name.sql
   
hoodie.streamer.schemaprovider.registry.url=http://localhost:8081/subjects/test.dbo.some_table_name-value/versions/latest
   
hoodie.streamer.schemaprovider.registry.targetUrl=http://localhost:8081/subjects/sometargettable/versions/latest
   schema.registry.url=http://localhost:8081
   hoodie.streamer.source.kafka.topic=test.dbo.some_table_name
   bootstrap.servers=localhost:9092
   auto.offset.reset=earliest
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
   hoodie.datasource.write.partitionpath.field=partition_path
   hoodie.datasource.write.recordkey.field=hudi_key
   hoodie.datasource.write.precombine.field=ts_ms
   hoodie.datasource.hive_sync.database=test_db
   ```
   
   > SQL Transformer File
   ```
   CACHE TABLE dbz_filtered AS
   SELECT ts_ms, op, before, after FROM <SRC> WHERE op IN ('d', 'u', 'c', 'r');
   
   CACHE TABLE dbz_events AS
   SELECT ts_ms, CASE WHEN op = 'd' THEN before ELSE after END AS 
source_fields, CASE WHEN op = 'd' THEN true ELSE false END AS is_deleted FROM 
dbz_filtered;
   
   CACHE TABLE dbz_fields AS
   SELECT ts_ms, source_fields.* FROM dbz_events;
   
   SELECT s.*, Concat(s.col1, s.col2) AS hudi_key, YEAR(FROM_UNIXTIME(s.col2 / 
1000)) AS partition_path FROM dbz_fields s;
   ```
   
   **Environment Description**
   
   * Hudi version : 0.15
   
   * Spark version : 3.5.0
   
   * Hive version : N/A
   
   * Hadoop version : N/A
   
   * Storage : S3
   
   * Running on Docker? : No
   
   Based on the provided information, the source class being utilized is 
org.apache.hudi.utilities.sources.AvroKafkaSource. By default, the payload 
class associated with this configuration is 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload.
   
   Based on the current configuration, the job appears to be executing as 
expected. However, we seek confirmation that this implementation effectively 
utilizes the HoodieStreamer for the intended use case, specifically for 
ingesting CDC (Change Data Capture) changes from SQL Server via Kafka into 
Apache Hudi tables while maintaining data integrity and consistency.
   
   Thank you
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] Use Case Validation for CDC Ingestion from Kafka/Debezium (SQL Server) into Apache Hudi [hudi]

Reply via email to