d4r3topk opened a new issue, #8644: URL: https://github.com/apache/hudi/issues/8644
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** We have shifted to Hudi 0.13.0 on Spark 3.3 and using it as an external library JAR for ingestion of datasets using AWS Glue streaming jobs. We are noticing atleast 10-13% data loss in every partition (partitioned by region=x/year=xxxx/month=xx/day=xx/hour=xx . The job that ingests this dataset leading to such data loss splits an incoming dynamic frame based on a field and then creates/updates a different hudi table based on each field. (Maximum tables - 3 via 1 job) We have another job that does not do this splitting and just ingests one table with the incoming dynamic frame. This job has no data loss. We have jobs running with the following parameters - init_load_config = {"hoodie.datasource.write.operation": "bulk_insert"} partition_data_config = { "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.write.partitionpath.field": "region,year,month,day,hour", "hoodie.datasource.hive_sync.partition_fields": "region,year,month,day,hour", } init_common_config = { "className": "org.apache.hudi", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.mode": "hms", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.write.reconcile.schema": "true", "write.parquet.max.file.size": 268435456, # 256 MB = 268435456 Bytes "hoodie.parquet.small.file.limit": 209715200, # 200 MB = 209715200 Bytes "hoodie.datasource.write.hive_style_partitioning": "true", } metadata_config = { "hoodie.archive.async": "true", "hoodie.archive.merge.enable": "true", "hoodie.datasource.hive_sync.support_timestamp": "true", "clustering.plan.partition.filter.mode": "RECENT_DAYS", "clustering.plan.strategy.daybased.lookback.partitions": 2 } plug_config = { "hoodie.datasource.write.precombine.field": "ApproximateArrivalTimestamp", "hoodie.datasource.write.table.type": "COPY_ON_WRITE", "hoodie.datasource.hive_sync.database": "db_name", "hoodie.table.name": "table_name", "hoodie.datasource.hive_sync.table": "table_name", "hoodie.copyonwrite.record.size.estimate": 2000, "hoodie.clustering.async": "false", "hoodie.metadata.enable": "false" } When we first noticed data loss, it was with the clustering and metadata enabled. However, after reading the regression of metadata and timeline server for streaming jobs (w/ COW), we disabled it and created a completely new job with new destination and new table names to validate no data loss but we're still losing data. More context - * No error messages in spark logs * This dataset splits a data stream by a field and splits it into 3 other glue dynamic frames and then runs a for loop to ingest all of these. To keep different names, we append the hoodie.table.name" and "hoodie.datasource.hive_sync.table" property to append the field value (For example, table_name_dataset1 where dataset1 is the value of the field). I was a bit doubtful whether 3 table updates is somehow dropping records. * Checked hudi cli commits show and then inspected few commits however it shows 0 write errors for the ones I've checked so far. (If there are any other ways to do this at scale in an automated way to check for all errors I would be happy to learn about it) **To Reproduce** Steps to reproduce the behavior: 1. Write a glue dynamic frame with 5 fields 2. Split the dynamic frame based on any one field 3. Iterate through the multiple dynamic frames and ingest into different hudi tables **Expected behavior** There should be no data loss and counts should exactly match. Incase any data loss occurs, the logs should be shown to indicate data loss and potential reason. **Environment Description** * Hudi version : 0.13.0 / 0.12.1 * Spark version : 3.3 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No. AWS Glue Streaming Job. **Additional context** Here are the counts for multiple partitions with different parameters and hudi/non-hudi # type day hour pk_count 1 hudi-metadata-and-clustering-enabled 04 00 362 2 hudi-no-metadata-no-clustering-enabled 04 00 366 3 non-hudi 04 00 366 4 hudi-metadata-and-clustering-enabled 04 01 262 5 hudi-no-metadata-no-clustering-enabled 04 01 367 6 non-hudi 04 01 367 7 hudi-metadata-and-clustering-enabled 04 02 252 8 hudi-no-metadata-no-clustering-enabled 04 02 353 9 non-hudi 04 02 365 10 hudi-metadata-and-clustering-enabled 04 03 262 11 hudi-no-metadata-no-clustering-enabled 04 03 360 12 non-hudi 04 03 366 13 hudi-metadata-and-clustering-enabled 04 04 248 14 hudi-no-metadata-no-clustering-enabled 04 04 354 15 non-hudi 04 04 366 16 hudi-metadata-and-clustering-enabled 04 05 252 17 hudi-no-metadata-no-clustering-enabled 04 05 366 18 non-hudi 04 05 366 19 hudi-metadata-and-clustering-enabled 04 06 251 20 hudi-no-metadata-no-clustering-enabled 04 06 365 21 non-hudi 04 06 365 22 hudi-metadata-and-clustering-enabled 04 07 256 23 hudi-no-metadata-no-clustering-enabled 04 07 368 24 non-hudi 04 07 368 25 hudi-metadata-and-clustering-enabled 04 08 251 26 hudi-no-metadata-no-clustering-enabled 04 08 360 27 non-hudi 04 08 366 28 hudi-metadata-and-clustering-enabled 04 09 253 29 hudi-no-metadata-no-clustering-enabled 04 09 359 30 non-hudi 04 09 365 31 hudi-metadata-and-clustering-enabled 04 10 258 32 hudi-no-metadata-no-clustering-enabled 04 10 366 33 non-hudi 04 10 366 34 hudi-metadata-and-clustering-enabled 04 11 258 35 hudi-no-metadata-no-clustering-enabled 04 11 366 36 non-hudi 04 11 366 37 hudi-metadata-and-clustering-enabled 04 12 260 38 hudi-no-metadata-no-clustering-enabled 04 12 360 39 non-hudi 04 12 366 40 hudi-metadata-and-clustering-enabled 04 13 252 41 hudi-no-metadata-no-clustering-enabled 04 13 364 42 non-hudi 04 13 364 43 hudi-metadata-and-clustering-enabled 04 14 253 44 hudi-no-metadata-no-clustering-enabled 04 14 366 45 non-hudi 04 14 366 46 hudi-metadata-and-clustering-enabled 04 15 254 47 hudi-no-metadata-no-clustering-enabled 04 15 361 48 non-hudi 04 15 367 49 hudi-metadata-and-clustering-enabled 04 16 254 50 hudi-no-metadata-no-clustering-enabled 04 16 365 51 non-hudi 04 16 365 52 hudi-metadata-and-clustering-enabled 04 17 250 53 hudi-no-metadata-no-clustering-enabled 04 17 314 54 non-hudi 04 17 366 55 hudi-metadata-and-clustering-enabled 04 18 250 56 hudi-no-metadata-no-clustering-enabled 04 18 360 57 non-hudi 04 18 367 58 hudi-metadata-and-clustering-enabled 04 19 253 59 hudi-no-metadata-no-clustering-enabled 04 19 318 60 non-hudi 04 19 365 61 hudi-metadata-and-clustering-enabled 04 20 257 62 hudi-no-metadata-no-clustering-enabled 04 20 364 63 non-hudi 04 20 364 64 hudi-metadata-and-clustering-enabled 04 21 252 65 hudi-no-metadata-no-clustering-enabled 04 21 364 66 non-hudi 04 21 364 67 hudi-metadata-and-clustering-enabled 04 22 266 68 hudi-no-metadata-no-clustering-enabled 04 22 364 69 non-hudi 04 22 371 70 hudi-metadata-and-clustering-enabled 04 23 256 71 hudi-no-metadata-no-clustering-enabled 04 23 368 72 non-hudi 04 23 368 **Stacktrace** No stacktrace as such. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org