Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
noahtaite commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-2115213114 Apologies again for the delay, we shelved clustering after this experiment and re-generated our lake with proper file sizing. Since this issue could affect others, I'll share my configs that got us there: Upsert config: ```json { "hoodie.datasource.hive_sync.database": "db", "hoodie.global.simple.index.parallelism": "1920", "hoodie.datasource.hive_sync.mode": "hms", "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.schema.on.read.enable": "false", "path": "s3://bucket/table.all_hudi", "hoodie.datasource.write.precombine.field": "CaptureDate", "hoodie.datasource.hive_sync.partition_fields": "datasource,year,month", "hoodie.datasource.write.payload.class": "org.apache.hudi.common.model.OverwriteWithLatestAvroPayload", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.meta.sync.metadata_file_listing": "true", "hoodie.cleaner.parallelism": "1920", "hoodie.datasource.meta.sync.enable": "true", "hoodie.datasource.hive_sync.skip_ro_suffix": "true", "hoodie.metadata.enable": "true", "hoodie.datasource.hive_sync.table": "table_all", "hoodie.datasource.meta_sync.condition.sync": "true", "hoodie.index.type": "GLOBAL_BLOOM", "hoodie.clean.automatic": "true", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.write.recordkey.field": "uuid", "hoodie.table.name": "table_all", "hoodie.write.lock.dynamodb.billing_mode": "PAY_PER_REQUEST", "hoodie.datasource.write.table.type": "MERGE_ON_READ", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.write.lock.dynamodb.endpoint_url": "*(redacted)", "hoodie.simple.index.parallelism": "1920", "hoodie.write.lock.dynamodb.partition_key": "table_all", "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS", "hoodie.write.concurrency.early.conflict.detection.enable": "true", "hoodie.compact.inline": "true", "hoodie.datasource.write.reconcile.schema": "true", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.cleaner.policy.failed.writes": "LAZY", "hoodie.keep.max.commits": "110", "hoodie.upsert.shuffle.parallelism": "1920", "hoodie.meta.sync.client.tool.class": "org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool", "hoodie.cleaner.commits.retained": "90", "hoodie.write.lock.dynamodb.table": "hudi-lock-provider", "hoodie.write.lock.provider": "org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider", "hoodie.keep.min.commits": "100", "hoodie.datasource.write.partitionpath.field": "datasource,year,month", "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL", "hoodie.write.lock.dynamodb.region": "us-east-1" } ``` Clustering properties: ```properties hoodie.clustering.async.enabled=true hoodie.clustering.async.max.commits=1 hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy hoodie.clustering.plan.strategy.target.file.max.bytes=524288000 hoodie.clustering.plan.strategy.small.file.limit=10485760 hoodie.clustering.preserve.commit.metadata=true ``` Clustering job: spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob /usr/lib/hudi/hudi-utilities-bundle.jar --props s3://bucket/properties/nt.clustering.properties --mode scheduleAndExecute --base-path s3://bucket/table.all_hudi/ --table-name table_all --spark-memory 90g --parallelism 1000 From what I could gather, it appears that applying soft deletes moves records to `__HIVE_DEFAULT_PARTITION__`, and when using a global index the old version of those records could still be visible in a snapshot query until compaction is ran. I observed this in Hudi 0.12.1 (AWS EMR 6.9.0). I don't currently have the bandwidth to experiment with this in our latest stable Hudi 0.13.1 (AWS EMR 6.12.0) job. Thanks again for all your help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
nsivabalan commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-2044025853 hey @noahtaite : any follow ups on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
noahtaite commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-1875739559 Hi @ad1happy2go , apologies for the delayed response due to holidays. I will update this post with complete configs shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
ad1happy2go commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-1845789929 @noahtaite Can you also post complete table and writer configs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
ad1happy2go commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-1845645037 @noahtaite Sorry for the delay here. We will look into it soon. Did you tried except() after dropping the Hoodie Meta columns? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
noahtaite commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-1839189354 Bump... I think data inconsistency after clustering should be treated as a critical priority investigation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
noahtaite closed issue #10172: [SUPPORT] Additional records in dataset after clustering URL: https://github.com/apache/hudi/issues/10172 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
noahtaite commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-1828531674 I have verified that the "additional" records are those that have moved to __HIVE_DEFAULT_PARTITION_ after we applied a soft delete from our incoming DMS records. After clustering, the snapshot shows these records as 'live' and I can see duplicates records output in my snapshot query: ``` val ctrl = spark.read.format("hudi").load("s3://bucket/path/table.all_hudi/") ctrl.filter(col("datasource").equalTo("datasource1")).filter(col("uuid").equalTo(ID_VALUE)). select("_hoodie_commit_time","uuid","Op","CaptureDate","totalvalue"). show(10,false) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
noahtaite commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-1828369177 I ran compaction on this table, following up with clustering. The results have changed as followed: Pre-compaction, pre-clustering: 177,822,668 Post-compaction, pre-clustering: 177,822,668 Post-compaction, post-clustering: **177,822,812 (144 more records)** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Additional records in dataset after clustering [hudi]
noahtaite opened a new issue, #10172: URL: https://github.com/apache/hudi/issues/10172 **Describe the problem you faced** We generated a medium-sized MoR table using bulk_insert with the following dimensions: - 3000 partitions - xx TB data - xx S3 objects Since we have many small files due to bulk_insert not automatically handling file sizing, we need to run clustering on the table to improve downstream read performance. After running clustering and counting the data, my count has grown from 177,822,668 to 177,828,417 (a count difference of ~6k records). When I run an **except()** between the clustered and control dataset, it outputs 3,127,201 records. I am trying to understand why there is a difference in count after running clustering and why there are 3M supposedly different records even though I have not changed the following default configuration: ``` hoodie.clustering.preserve.commit.metadata When rewriting data, preserves existing hoodie_commit_time Default Value: true (Optional) Config Param: PRESERVE_COMMIT_METADATA Since Version: 0.9.0 ``` **To Reproduce** Steps to reproduce the behavior: 1. Generate MoR table and update many times without running compaction. 2. Run a snapshot query against the table to get the count. 3. Run clustering on the table. 4. Run another snapshot query. 5. Observe the count has changed **Expected behavior** I expected the count to remain the same. I did see new files created from the clustering process. **Environment Description** * Hudi version : 0.12.1-amzn-0 * Spark version : 3.3.0 * Hive version : 3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** - We noticed today that compaction has never been run on this table. Wondering if that has any impact on clustering? **Stacktrace** N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org