BalaMahesh opened a new issue, #7733:
URL: https://github.com/apache/hudi/issues/7733

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   We have Postgres data coming from debezium connector via Kafka. We are 
running Hudi in upsert mode on this dataset, we have seen that there are around 
12 records which has two versions of data for the same id instead of updating 
the latest values and cleaning the old record.
   
   We are yet not clear how this is the case since this data is from older 
commits. 
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Start Postgres Debezium Kafka connector and publish data to Kafka
   2. Run Hudi in upsert mode
   3. We are not sure whether there are any crashes happened during those 
commits.
   4. Use below configurations.5. 
   
   
hoodie.compaction.payload.class=org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload
   hoodie.table.type=MERGE_ON_READ
   hoodie.table.metadata.partitions=
   hoodie.table.precombine.field=_event_lsn
   hoodie.table.partition.fields=
   hoodie.archivelog.folder=archived
   hoodie.timeline.layout.version=1
   hoodie.table.checksum=4134192528
   hoodie.datasource.write.drop.partition.columns=false
   hoodie.table.recordkey.fields=id
   hoodie.partition.metafile.use.base.format=false
   hoodie.populate.meta.fields=true
   
hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator
   hoodie.table.base.file.format=PARQUET
   hoodie.table.version=5
   
   
   **Expected behavior**
   
   We expect only version of the record to be available in the latest queried 
data. 
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.2.1
   
   * Hive version : 2.3.5
   
   * Hadoop version : 2.7.7
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : yes.
   
   
   **Additional context**
   
   hoodie.datasource.write.recordkey.field=id
   hoodie.datasource.write.partitionpath.field=
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.clean.automatic=true
   hoodie.clean.async=true
   hoodie.cleaner.commits.retained=5
   hoodie.keep.min.commits=10
   #compaction config
   hoodie.datasource.compaction.async.enable=true
   hoodie.parquet.small.file.limit=104857600
   hoodie.compaction.target.io=50
   
   **Stacktrace**
   
   ```
   Id              updated          _hoodie_commit_time. _event_lsn
   Aa5udG       1667998354      20221109125316627       5037873812216
   Aa5udG       1667972649      20221109055102633       5028051185232
   Aa61Gb       1667998400      20221109125802500       5037878072632
   Aa61Gb       1667972837      20221109055102633       5028239838008
   Aa7hZx       1667998411      20221109125802500       5037879344768
   Aa7hZx       1667973014      20221109055102633       5028334998944
   Aa81Sq       1667998439      20221109125802500       5037897355680
   Aa81Sq       1667973345      20221109055825061       5028484902408
   AbB9sW       1668051396      20221110034051271       5045161427664
   AbB9sW       1667974610      20221109061740419       5029141615480
   OiYzUz       1672662739      20230112125716390       6287523270024
   OiYzUz       1672662739      20230112125716390       6287523270024
   XxNzFk       1667982183      20221109082337760       5031758334520
   XxNzFk       1667981380      20221109081024733       5031516715520
   YxNzFk       1667982167      20221109082337760       5031758226096
   YxNzFk       1667981376      20221109081024733       5031516565840
   YbB9sW       1668051393      20221110034051271       5045160856976
   YbB9sW       1667974609      20221109061740419       5029141513960
   ZxNzFk       1667982174      20221109082337760       5031755205544
   ZxNzFk       1667981375      20221109081024733       5031516243272
   ZanXvJ       1668051273      20221110033657677       5045153106408
   ZanXvJ       1667967621      20221109042439193       5025825527744
   ZbB9sW       1668051391      20221110034051271       5045160222496
   ZbB9sW       1667974609      20221109061740419       5029141376128
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to