parisni opened a new issue, #9026: URL: https://github.com/apache/hudi/issues/9026
hudi >= 0.11 (including 0.13.1) I noticed we have duplicates in our metadata tables : ``` >>> spark.read.format("hudi").load("/tmp/metadata").filter("key='version=2/event_date=2009-12-03/event_hour=08'").select("key","filesystemMetadata").show(10, False,True) -RECORD 0------------------------------------------------------------------------------------------------------------------- key | version=2/event_date=2009-12-03/event_hour=08 filesystemMetadata | {3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet -> {445028, false}} -RECORD 1------------------------------------------------------------------------------------------------------------------- key | version=2/event_date=2009-12-03/event_hour=08 filesystemMetadata | {3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet -> {445028, false}} -RECORD 2------------------------------------------------------------------------------------------------------------------- key | version=2/event_date=2009-12-03/event_hour=08 filesystemMetadata | {3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet -> {445028, false}} -RECORD 3------------------------------------------------------------------------------------------------------------------- key | version=2/event_date=2009-12-03/event_hour=08 filesystemMetadata | {3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet -> {445028, false}} -RECORD 4------------------------------------------------------------------------------------------------------------------- key | version=2/event_date=2009-12-03/event_hour=08 filesystemMetadata | {3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet -> {445028, false}} -RECORD 5------------------------------------------------------------------------------------------------------------------- key | version=2/event_date=2009-12-03/event_hour=08 filesystemMetadata | {3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet -> {445028, false}} -RECORD 6------------------------------------------------------------------------------------------------------------------- key | version=2/event_date=2009-12-03/event_hour=08 filesystemMetadata | {3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet -> {445028, false}} -RECORD 7------------------------------------------------------------------------------------------------------------------- key | version=2/event_date=2009-12-03/event_hour=08 filesystemMetadata | {3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet -> {445028, false}} ``` Interestingly, I also have 1 hfile and 7 logs. On our other tables, the number of duplicated partitions is equal to the number of logs files. ``` ls /tmp/metadata/files/ .hoodie_partition_metadata files-0000_0-16-519_20230620071307473001.hfile .files-0000_20230620071307473001.log.1_0-23-725 .files-0000_20230620071307473001.log.2_0-16-719 .files-0000_20230620071307473001.log.3_0-16-721 .files-0000_20230620071307473001.log.4_0-16-723 .files-0000_20230620071307473001.log.5_0-16-724 .files-0000_20230620071307473001.log.6_0-16-727 .files-0000_20230620071307473001.log.7_0-16-729 ``` Here is a reproductible script. after reaching the compaction number, the mdt sudently gets duplicates when reading: ```python tableName = 'test_corrupted_mdt' basePath = "/tmp/{tableName}".format(tableName=tableName) hudi_options = { "hoodie.table.name": tableName, "hoodie.datasource.write.recordkey.field": "event_id", "hoodie.datasource.write.partitionpath.field": "version,event_date", "hoodie.datasource.write.table.name": tableName, "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.precombine.field": "ts", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.hive_sync.enable": "false", "hoodie.metadata.enable": "true", } mode="overwrite" for i in range(1,11): df =spark.sql("select '1' as event_id, '2' as ts, '"+str(i)+"' as version, 'foo' as event_date") (df.write.format("hudi").options(**hudi_options).mode(mode).save(basePath)) mode="append" spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count() spark.read.format("hudi").load(basePath + "/.hoodie/metadata").select("key").dropDuplicates().count() >>> spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count() 21 >>> spark.read.format("hudi").load(basePath + "/.hoodie/metadata").select("key").dropDuplicates().count() 11 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org