parisni commented on issue #9026: URL: https://github.com/apache/hudi/issues/9026#issuecomment-1627681838
@yihua > If the metadata table is queried through Spark datasource directly after MDT compaction (i.e., no additional log file in the latest file slice), there is no duplicate. Did you add new partition during that step ? It turns out the duplication occurs when new partitions are added after compaction. see below: when no new partitions, no duplication. When new partitions, then it gets tons of duplicates. ```python sc.setLogLevel("ERROR") tableName = 'test_corrupted_mdt' basePath = "/tmp/{tableName}".format(tableName=tableName) hudi_options = { "hoodie.table.name": tableName, "hoodie.datasource.write.recordkey.field": "event_id", "hoodie.datasource.write.partitionpath.field": "part", "hoodie.datasource.write.table.name": tableName, "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.precombine.field": "ts", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.hive_sync.enable": "false", "hoodie.metadata.enable": "true", } mode="overwrite" for i in range(1,22): df =spark.sql("select '1' as event_id, '2' as ts, '"+str(i)+"' as part") # <-- W/ adding new partitions # df =spark.sql("select '1' as event_id, '2' as ts, '2' as part") <-- W/O adding new partitions (df.write.format("hudi").options(**hudi_options).mode(mode).save(basePath)) mode="append" ct = spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count() print("NB:"+str(ct) + " for iteration:" + str(i)) NB:2 for iteration:1 NB:3 for iteration:2 NB:4 for iteration:3 NB:5 for iteration:4 NB:6 for iteration:5 NB:7 for iteration:6 NB:8 for iteration:7 NB:9 for iteration:8 NB:10 for iteration:9 NB:21 for iteration:10 <--- MDT COMPACTION NB:32 for iteration:11 NB:43 for iteration:12 NB:54 for iteration:13 NB:65 for iteration:14 NB:76 for iteration:15 NB:87 for iteration:16 NB:98 for iteration:17 NB:109 for iteration:18 NB:120 for iteration:19 NB:41 for iteration:20 <--- MDT COMPACTION NB:62 for iteration:21 NB:2 for iteration:1 NB:2 for iteration:2 NB:2 for iteration:3 NB:2 for iteration:4 NB:2 for iteration:5 NB:2 for iteration:6 NB:2 for iteration:7 NB:2 for iteration:8 NB:2 for iteration:9 NB:2 for iteration:10 <--- MDT COMPACTION NB:2 for iteration:11 NB:2 for iteration:12 NB:2 for iteration:13 NB:2 for iteration:14 NB:2 for iteration:15 NB:2 for iteration:16 NB:2 for iteration:17 NB:2 for iteration:18 NB:2 for iteration:19 NB:2 for iteration:20 <--- MDT COMPACTION NB:2 for iteration:21 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org