parisni opened a new issue, #9026:
URL: https://github.com/apache/hudi/issues/9026

   hudi >= 0.11 (including 0.13.1)
   
   I noticed we have  duplicates in our metadata tables :
   ```
   >>> 
spark.read.format("hudi").load("/tmp/metadata").filter("key='version=2/event_date=2009-12-03/event_hour=08'").select("key","filesystemMetadata").show(10,
 False,True)
   -RECORD 
0-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
1-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
2-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
3-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
4-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
5-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
6-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
7-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   ```
   
   Interestingly, I also have 1 hfile and 7 logs. On our other tables, the 
number of duplicated partitions is equal to the number of logs files.
   ```
   ls /tmp/metadata/files/
   .hoodie_partition_metadata
   files-0000_0-16-519_20230620071307473001.hfile
   .files-0000_20230620071307473001.log.1_0-23-725
   .files-0000_20230620071307473001.log.2_0-16-719
   .files-0000_20230620071307473001.log.3_0-16-721
   .files-0000_20230620071307473001.log.4_0-16-723
   .files-0000_20230620071307473001.log.5_0-16-724
   .files-0000_20230620071307473001.log.6_0-16-727
   .files-0000_20230620071307473001.log.7_0-16-729
   ```
   
   Here is a reproductible script. after reaching the compaction number, the 
mdt sudently gets duplicates when reading:
   
   ```python
   tableName = 'test_corrupted_mdt'
   basePath = "/tmp/{tableName}".format(tableName=tableName)
   
   hudi_options = {
       "hoodie.table.name": tableName,
       "hoodie.datasource.write.recordkey.field": "event_id",
       "hoodie.datasource.write.partitionpath.field": "version,event_date",
       "hoodie.datasource.write.table.name": tableName,
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.datasource.write.precombine.field": "ts",
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.write.hive_style_partitioning": "true",
       "hoodie.datasource.hive_sync.enable": "false",
       "hoodie.metadata.enable": "true",
   }
   mode="overwrite"
   for i in range(1,11):
       df =spark.sql("select '1' as event_id, '2' as ts, '"+str(i)+"' as 
version, 'foo' as event_date")
       
(df.write.format("hudi").options(**hudi_options).mode(mode).save(basePath))
       mode="append"
   
   spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count()
   spark.read.format("hudi").load(basePath + 
"/.hoodie/metadata").select("key").dropDuplicates().count()
   
   >>> spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count()
   21
   >>> spark.read.format("hudi").load(basePath + 
"/.hoodie/metadata").select("key").dropDuplicates().count()
   11
   
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to