Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2024-05-16 Thread via GitHub


noahtaite commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-2115213114

   Apologies again for the delay, we shelved clustering after this experiment 
and re-generated our lake with proper file sizing.
   
   Since this issue could affect others, I'll share my configs that got us 
there:
   Upsert config:
   ```json
   {
 "hoodie.datasource.hive_sync.database": "db",
 "hoodie.global.simple.index.parallelism": "1920",
 "hoodie.datasource.hive_sync.mode": "hms",
 "hoodie.datasource.hive_sync.support_timestamp": "true",
 "hoodie.schema.on.read.enable": "false",
 "path": "s3://bucket/table.all_hudi",
 "hoodie.datasource.write.precombine.field": "CaptureDate",
 "hoodie.datasource.hive_sync.partition_fields": "datasource,year,month",
 "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.OverwriteWithLatestAvroPayload",
 "hoodie.datasource.hive_sync.use_jdbc": "false",
 "hoodie.meta.sync.metadata_file_listing": "true",
 "hoodie.cleaner.parallelism": "1920",
 "hoodie.datasource.meta.sync.enable": "true",
 "hoodie.datasource.hive_sync.skip_ro_suffix": "true",
 "hoodie.metadata.enable": "true",
 "hoodie.datasource.hive_sync.table": "table_all",
 "hoodie.datasource.meta_sync.condition.sync": "true",
 "hoodie.index.type": "GLOBAL_BLOOM",
 "hoodie.clean.automatic": "true",
 "hoodie.datasource.write.operation": "upsert",
 "hoodie.datasource.hive_sync.enable": "true",
 "hoodie.datasource.write.recordkey.field": "uuid",
 "hoodie.table.name": "table_all",
 "hoodie.write.lock.dynamodb.billing_mode": "PAY_PER_REQUEST",
 "hoodie.datasource.write.table.type": "MERGE_ON_READ",
 "hoodie.datasource.write.hive_style_partitioning": "true",
 "hoodie.write.lock.dynamodb.endpoint_url": "*(redacted)",
 "hoodie.simple.index.parallelism": "1920",
 "hoodie.write.lock.dynamodb.partition_key": "table_all",
 "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS",
 "hoodie.write.concurrency.early.conflict.detection.enable": "true",
 "hoodie.compact.inline": "true",
 "hoodie.datasource.write.reconcile.schema": "true",
 "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
 "hoodie.cleaner.policy.failed.writes": "LAZY",
 "hoodie.keep.max.commits": "110",
 "hoodie.upsert.shuffle.parallelism": "1920",
 "hoodie.meta.sync.client.tool.class": 
"org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool",
 "hoodie.cleaner.commits.retained": "90",
 "hoodie.write.lock.dynamodb.table": "hudi-lock-provider",
 "hoodie.write.lock.provider": 
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider",
 "hoodie.keep.min.commits": "100",
 "hoodie.datasource.write.partitionpath.field": "datasource,year,month",
 "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL",
 "hoodie.write.lock.dynamodb.region": "us-east-1"
   }
   ```
   
   Clustering properties:
   ```properties
   hoodie.clustering.async.enabled=true
   hoodie.clustering.async.max.commits=1
   
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   hoodie.clustering.plan.strategy.target.file.max.bytes=524288000
   hoodie.clustering.plan.strategy.small.file.limit=10485760
   hoodie.clustering.preserve.commit.metadata=true
   ```
   
   Clustering job:
   spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob 
/usr/lib/hudi/hudi-utilities-bundle.jar --props 
s3://bucket/properties/nt.clustering.properties --mode scheduleAndExecute 
--base-path s3://bucket/table.all_hudi/ --table-name table_all --spark-memory 
90g --parallelism 1000
   
   
   From what I could gather, it appears that applying soft deletes moves 
records to `__HIVE_DEFAULT_PARTITION__`, and when using a global index the old 
version of those records could still be visible in a snapshot query until 
compaction is ran. I observed this in Hudi 0.12.1 (AWS EMR 6.9.0). I don't 
currently have the bandwidth to experiment with this in our latest stable Hudi 
0.13.1 (AWS EMR 6.12.0) job.
   
   Thanks again for all your help.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-2044025853

   hey @noahtaite : any follow ups on this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2024-01-03 Thread via GitHub


noahtaite commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-1875739559

   Hi @ad1happy2go , apologies for the delayed response due to holidays. I will 
update this post with complete configs shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2023-12-07 Thread via GitHub


ad1happy2go commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-1845789929

   @noahtaite Can you also post complete table and writer configs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2023-12-07 Thread via GitHub


ad1happy2go commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-1845645037

   @noahtaite Sorry for the delay here. We will look into it soon. Did you 
tried except()  after dropping the Hoodie Meta columns?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2023-12-04 Thread via GitHub


noahtaite commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-1839189354

   Bump... I think data inconsistency after clustering should be treated as a 
critical priority investigation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2023-11-27 Thread via GitHub


noahtaite closed issue #10172: [SUPPORT] Additional records in dataset after 
clustering
URL: https://github.com/apache/hudi/issues/10172


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2023-11-27 Thread via GitHub


noahtaite commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-1828531674

   I have verified that the "additional" records are those that have moved to 
__HIVE_DEFAULT_PARTITION_ after we applied a soft delete from our incoming DMS 
records.
   
   After clustering, the snapshot shows these records as 'live' and I can see 
duplicates records output in my snapshot query:
   
   ```
   val ctrl = spark.read.format("hudi").load("s3://bucket/path/table.all_hudi/")
   
   
ctrl.filter(col("datasource").equalTo("datasource1")).filter(col("uuid").equalTo(ID_VALUE)).
   select("_hoodie_commit_time","uuid","Op","CaptureDate","totalvalue").
   show(10,false)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2023-11-27 Thread via GitHub


noahtaite commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-1828369177

   I ran compaction on this table, following up with clustering. The results 
have changed as followed:
   Pre-compaction, pre-clustering: 177,822,668
   Post-compaction, pre-clustering: 177,822,668
   Post-compaction, post-clustering: **177,822,812 (144 more records)**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Additional records in dataset after clustering [hudi]

2023-11-24 Thread via GitHub


noahtaite opened a new issue, #10172:
URL: https://github.com/apache/hudi/issues/10172

   **Describe the problem you faced**
   
   We generated a medium-sized MoR table using bulk_insert with the following 
dimensions:
   - 3000 partitions 
   - xx TB data
   - xx S3 objects
   
   Since we have many small files due to bulk_insert not automatically handling 
file sizing, we need to run clustering on the table to improve downstream read 
performance. 
   
   After running clustering and counting the data, my count has grown from 
177,822,668 to 177,828,417 (a count difference of ~6k records). When I run an 
**except()** between the clustered and control dataset, it outputs 3,127,201 
records.
   
   I am trying to understand why there is a difference in count after running 
clustering and why there are 3M supposedly different records even though I have 
not changed the following default configuration:
   ```
   hoodie.clustering.preserve.commit.metadata
   When rewriting data, preserves existing hoodie_commit_time
   Default Value: true (Optional)
   Config Param: PRESERVE_COMMIT_METADATA
   Since Version: 0.9.0
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Generate MoR table and update many times without running compaction.
   2. Run a snapshot query against the table to get the count.
   3. Run clustering on the table.
   4. Run another snapshot query.
   5. Observe the count has changed 
   
   **Expected behavior**
   
   I expected the count to remain the same. I did see new files created from 
the clustering process.
   
   **Environment Description**
   
   * Hudi version : 0.12.1-amzn-0
   
   * Spark version : 3.3.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   - We noticed today that compaction has never been run on this table. 
Wondering if that has any impact on clustering?
   
   
   **Stacktrace**
   
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org