[ https://issues.apache.org/jira/browse/HUDI-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yue Zhang updated HUDI-2892: ---------------------------- Description: Step 1 Do a normal hudi insert drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties Step 2 Build a clustering plan but don't execute this plan 20211130114103632.replacecommit.requested will cluster data files from 20211130113918979.commit drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 20211130114103632.replacecommit.requested drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties Step 3 Do a few times hudi insert and trigger several archivals drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:44 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 20211130114103632.replacecommit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:41 20211130114122881.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 20211130114122881.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 20211130114122881.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:42 20211130114207164.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 20211130114207164.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 20211130114207164.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:44 20211130114351703.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 20211130114351703.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 20211130114351703.inflight drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:23 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 20211130114103632.replacecommit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 20211130131825336.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132256488.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132327154.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.inflight drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 hoodie.properties 20211130114122881.commit 20211130114207164.commit and 20211130114351703.commit were archived. Step 4 Do query to check record numbers and based hudi data files. val frame = spark.sql("select count(*) from hudi_test").show(10000, false) => +--------+ |count(1)| +--------+ |4217794 | +--------+ val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false) => +----------------------------------------------------------------------+ |_hoodie_file_name | +----------------------------------------------------------------------+ |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet| |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet| |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet| |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet| |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet| |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet| |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet| |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet| |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet| |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet| |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet| |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet| |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet| |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet| +----------------------------------------------------------------------+ Step 5 Stop insert and trigger that pending clustering replace request drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:27 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 4736 11 30 13:27 20211130114103632.replacecommit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:27 20211130114103632.replacecommit.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 20211130114103632.replacecommit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 20211130131825336.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132256488.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132327154.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.inflight drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 hoodie.properties Step 6 Do the same queries to check record numbers and based hudi data files. val frame = spark.sql("select count(*) from hudi_test").show(10000, false) => +--------+ |count(1)| +--------+ |2410168 | +--------+ val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false) => +----------------------------------------------------------------------+ |_hoodie_file_name | +----------------------------------------------------------------------+ |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet| |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet| |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet| |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet| |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet| |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet| |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet| |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet| +----------------------------------------------------------------------+ As we can see, we get different query result compared with before-clustering and after-clustering. Also query result from Step 6 is missing records from these base file mentioned below. |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet| |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet| |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet| |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet| |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet| |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet| The root cause of this incomplete query results is that late finished clustering instant stain this activeTimeLine hoodie get wrong latest base file according to https://github.com/apache/hudi/blob/55ecbc662e30068ce0ed49166d254202bd598a8c/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java#L120 To fix this bug, we need let pending clustering instant to block archive action like pending compaction did. P.S. Each ingestion will insert 602,542 records. 20211130114103632.replacecommit { "partitionToWriteStats" : { "20210623" : [ { "fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0", "path" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet", "prevCommit" : "null", "numWrites" : 602542, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 602542, "totalWriteBytes" : 17645296, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "20210623", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 17645296, "minEventTime" : null, "maxEventTime" : null } ] }, "compacted" : false, "extraMetadata" : { "schema" : "xxxxx" }, "operationType" : "CLUSTER", "partitionToReplaceFileIds" : { "20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", "a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ] }, "fileIdAndRelativePaths" : { "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet" }, "totalRecordsDeleted" : 0, "totalLogRecordsCompacted" : 0, "totalLogFilesCompacted" : 0, "totalCompactedRecordsUpdated" : 0, "totalLogFilesSize" : 0, "totalScanTime" : 0, "totalCreateTime" : 11053, "totalUpsertTime" : 0, "minAndMaxEventTime" : { "Optional.empty" : { "val" : null, "present" : false } }, "writePartitionPaths" : [ "20210623" ] } was: **Describe the problem you faced** If there's a pending clustering instant still existed in active timeline after several archival actions. Next time we finish this pending clustering instant, this clustering instant may stain the ActiveTimeLine and lead to incomplete query results **To Reproduce** **Step 1** Do a normal hudi insert ``` drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties ``` **Step 2** Build a clustering plan but don't execute this plan 20211130114103632.replacecommit.requested will cluster data files from 20211130113918979.commit ``` drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 20211130114103632.replacecommit.requested drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties ``` **Step 3** Do a few times hudi insert and trigger several archivals ``` drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:44 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 20211130114103632.replacecommit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:41 20211130114122881.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 20211130114122881.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 20211130114122881.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:42 20211130114207164.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 20211130114207164.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 20211130114207164.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:44 20211130114351703.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 20211130114351703.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 20211130114351703.inflight drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties ``` ``` drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:23 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 20211130114103632.replacecommit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 20211130131825336.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132256488.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132327154.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.inflight drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 hoodie.properties ``` 20211130114122881.commit 20211130114207164.commit and 20211130114351703.commit were archived. **Step 4** Do query to check record numbers and based hudi data files. ``` val frame = spark.sql("select count(*) from hudi_test").show(10000, false) => +--------+ |count(1)| +--------+ |4217794 | +--------+ val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false) => +----------------------------------------------------------------------+ |_hoodie_file_name | +----------------------------------------------------------------------+ |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet| |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet| |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet| |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet| |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet| |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet| |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet| |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet| |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet| |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet| |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet| |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet| |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet| |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet| +----------------------------------------------------------------------+ ``` **Step 5** Stop insert and trigger that pending clustering replace request ``` drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:27 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 4736 11 30 13:27 20211130114103632.replacecommit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:27 20211130114103632.replacecommit.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 20211130114103632.replacecommit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 20211130131825336.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132256488.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132327154.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.inflight drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 hoodie.properties ``` **Step 6** Do the same queries to check record numbers and based hudi data files. ``` val frame = spark.sql("select count(*) from hudi_test").show(10000, false) => +--------+ |count(1)| +--------+ |2410168 | +--------+ val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false) => +----------------------------------------------------------------------+ |_hoodie_file_name | +----------------------------------------------------------------------+ |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet| |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet| |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet| |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet| |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet| |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet| |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet| |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet| +----------------------------------------------------------------------+ ``` As we can see, we get different query result compared with before-clustering and after-clustering. Also query result from Step 6 is missing records from these base file mentioned below. ``` |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet| |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet| |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet| |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet| |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet| |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet| ``` The root cause of this incomplete query results is that late finished clustering instant stain this activeTimeLine hoodie get wrong latest base file according to https://github.com/apache/hudi/blob/55ecbc662e30068ce0ed49166d254202bd598a8c/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java#L120 To fix this bug, we need let pending clustering instant to block archive action like pending compaction did. P.S. Each ingestion will insert 602,542 records. ``` 20211130114103632.replacecommit { "partitionToWriteStats" : { "20210623" : [ { "fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0", "path" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet", "prevCommit" : "null", "numWrites" : 602542, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 602542, "totalWriteBytes" : 17645296, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "20210623", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 17645296, "minEventTime" : null, "maxEventTime" : null } ] }, "compacted" : false, "extraMetadata" : { "schema" : "xxxxx" }, "operationType" : "CLUSTER", "partitionToReplaceFileIds" : { "20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", "a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ] }, "fileIdAndRelativePaths" : { "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet" }, "totalRecordsDeleted" : 0, "totalLogRecordsCompacted" : 0, "totalLogFilesCompacted" : 0, "totalCompactedRecordsUpdated" : 0, "totalLogFilesSize" : 0, "totalScanTime" : 0, "totalCreateTime" : 11053, "totalUpsertTime" : 0, "minAndMaxEventTime" : { "Optional.empty" : { "val" : null, "present" : false } }, "writePartitionPaths" : [ "20210623" ] } ``` **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : master * Spark version : 2.4.4 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : * Running on Docker? (yes/no) : **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` > Pending Clustering may stain the ActiveTimeLine and lead to incomplete query > results > ------------------------------------------------------------------------------------ > > Key: HUDI-2892 > URL: https://issues.apache.org/jira/browse/HUDI-2892 > Project: Apache Hudi > Issue Type: Bug > Reporter: Yue Zhang > Priority: Major > > > Step 1 > Do a normal hudi insert > drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ > drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 > 20211130113918979.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 > 20211130113918979.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 > 20211130113918979.inflight > drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 > archived/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 > hoodie.properties > Step 2 > Build a clustering plan but don't execute this plan > 20211130114103632.replacecommit.requested will cluster data files from > 20211130113918979.commit > drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ > drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 > 20211130113918979.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 > 20211130113918979.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 > 20211130113918979.inflight > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 > 20211130114103632.replacecommit.requested > drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 > archived/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 > hoodie.properties > Step 3 > Do a few times hudi insert and trigger several archivals > drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ > drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:44 .temp/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 > 20211130113918979.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 > 20211130113918979.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 > 20211130113918979.inflight > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 > 20211130114103632.replacecommit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:41 > 20211130114122881.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 > 20211130114122881.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 > 20211130114122881.inflight > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:42 > 20211130114207164.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 > 20211130114207164.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 > 20211130114207164.inflight > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:44 > 20211130114351703.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 > 20211130114351703.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 > 20211130114351703.inflight > drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 > archived/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 > hoodie.properties > drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/ > drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:23 .temp/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 > 20211130114103632.replacecommit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 > 20211130131825336.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 > 20211130131825336.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 > 20211130131825336.inflight > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 > 20211130132256488.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 > 20211130132256488.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 > 20211130132256488.inflight > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 > 20211130132327154.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 > 20211130132327154.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 > 20211130132327154.inflight > drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 > archived/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 > hoodie.properties > 20211130114122881.commit 20211130114207164.commit and > 20211130114351703.commit were archived. > Step 4 > Do query to check record numbers and based hudi data files. > val frame = spark.sql("select count(*) from hudi_test").show(10000, false) > => > +--------+ > |count(1)| > +--------+ > |4217794 | > +--------+ > val frame = spark.sql("select distinct(_hoodie_file_name) from > hudi_test").show(10000, false) > => > +----------------------------------------------------------------------+ > |_hoodie_file_name | > +----------------------------------------------------------------------+ > |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet| > |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet| > |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet| > |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet| > |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet| > |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet| > |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet| > |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet| > |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet| > |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet| > |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet| > |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet| > |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet| > |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet| > +----------------------------------------------------------------------+ > Step 5 > Stop insert and trigger that pending clustering replace request > drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/ > drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:27 .temp/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 4736 11 30 13:27 > 20211130114103632.replacecommit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:27 > 20211130114103632.replacecommit.inflight > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 > 20211130114103632.replacecommit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 > 20211130131825336.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 > 20211130131825336.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 > 20211130131825336.inflight > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 > 20211130132256488.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 > 20211130132256488.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 > 20211130132256488.inflight > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 > 20211130132327154.commit > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 > 20211130132327154.commit.requested > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 > 20211130132327154.inflight > drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 > archived/ > -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 > hoodie.properties > Step 6 > Do the same queries to check record numbers and based hudi data files. > val frame = spark.sql("select count(*) from hudi_test").show(10000, false) > => > +--------+ > |count(1)| > +--------+ > |2410168 | > +--------+ > val frame = spark.sql("select distinct(_hoodie_file_name) from > hudi_test").show(10000, false) > => > +----------------------------------------------------------------------+ > |_hoodie_file_name | > +----------------------------------------------------------------------+ > |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet| > |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet| > |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet| > |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet| > |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet| > |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet| > |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet| > |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet| > +----------------------------------------------------------------------+ > As we can see, we get different query result compared with before-clustering > and after-clustering. > Also query result from Step 6 is missing records from these base file > mentioned below. > |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet| > |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet| > |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet| > |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet| > |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet| > |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet| > > The root cause of this incomplete query results is that late finished > clustering instant stain this activeTimeLine hoodie get wrong latest base > file according to > https://github.com/apache/hudi/blob/55ecbc662e30068ce0ed49166d254202bd598a8c/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java#L120 > To fix this bug, we need let pending clustering instant to block archive > action like pending compaction did. > P.S. > Each ingestion will insert 602,542 records. > 20211130114103632.replacecommit > { > "partitionToWriteStats" : { > "20210623" : [ { > "fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0", > "path" : > "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet", > "prevCommit" : "null", > "numWrites" : 602542, > "numDeletes" : 0, > "numUpdateWrites" : 0, > "numInserts" : 602542, > "totalWriteBytes" : 17645296, > "totalWriteErrors" : 0, > "tempPath" : null, > "partitionPath" : "20210623", > "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, > "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, > "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, > "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 17645296, > "minEventTime" : null, > "maxEventTime" : null > } ] > }, > "compacted" : false, > "extraMetadata" : { > "schema" : "xxxxx" > }, > "operationType" : "CLUSTER", > "partitionToReplaceFileIds" : { > "20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", > "a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ] > }, > "fileIdAndRelativePaths" : { > "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : > "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet" > }, > "totalRecordsDeleted" : 0, > "totalLogRecordsCompacted" : 0, > "totalLogFilesCompacted" : 0, > "totalCompactedRecordsUpdated" : 0, > "totalLogFilesSize" : 0, > "totalScanTime" : 0, > "totalCreateTime" : 11053, > "totalUpsertTime" : 0, > "minAndMaxEventTime" : { > "Optional.empty" : { > "val" : null, > "present" : false > } > }, > "writePartitionPaths" : [ "20210623" ] > } > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)