Yue Zhang created HUDI-2892: ------------------------------- Summary: Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results Key: HUDI-2892 URL: https://issues.apache.org/jira/browse/HUDI-2892 Project: Apache Hudi Issue Type: Bug Reporter: Yue Zhang
Step 1 Do a normal hudi insert drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties Step 2 Build a clustering plan but don't execute this plan 20211130114103632.replacecommit.requested will cluster data files from 20211130113918979.commit drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 20211130114103632.replacecommit.requested drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties Step 3 Do a few times hudi insert and trigger several archivals drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:44 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 20211130114103632.replacecommit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:41 20211130114122881.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 20211130114122881.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 20211130114122881.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:42 20211130114207164.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 20211130114207164.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 20211130114207164.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:44 20211130114351703.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 20211130114351703.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 20211130114351703.inflight drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:23 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 20211130114103632.replacecommit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 20211130131825336.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132256488.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132327154.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.inflight drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 hoodie.properties 20211130114122881.commit 20211130114207164.commit and 20211130114351703.commit were archived. Step 4 Do query to check record numbers and based hudi data files. val frame = spark.sql("select count(*) from hudi_test").show(10000, false) => +--------+ |count(1)| +--------+ |4217794 | +--------+ val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false) => +----------------------------------------------------------------------+ |_hoodie_file_name | +----------------------------------------------------------------------+ |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet| |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet| |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet| |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet| |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet| |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet| |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet| |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet| |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet| |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet| |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet| |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet| |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet| |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet| +----------------------------------------------------------------------+ Step 5 Stop insert and trigger that pending clustering replace request drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/ drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:27 .temp/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 4736 11 30 13:27 20211130114103632.replacecommit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:27 20211130114103632.replacecommit.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 20211130114103632.replacecommit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 20211130131825336.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132256488.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.inflight -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132327154.commit -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.commit.requested -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.inflight drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 archived/ -rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 hoodie.properties Step 6 Do the same queries to check record numbers and based hudi data files. val frame = spark.sql("select count(*) from hudi_test").show(10000, false) => +--------+ |count(1)| +--------+ |2410168 | +--------+ val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false) => +----------------------------------------------------------------------+ |_hoodie_file_name | +----------------------------------------------------------------------+ |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet| |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet| |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet| |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet| |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet| |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet| |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet| |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet| +----------------------------------------------------------------------+ As we can see, we get different query result compared with before-clustering and after-clustering. Also query result from Step 6 is missing records from these base file mentioned below. |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet| |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet| |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet| |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet| |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet| |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet| The root cause of this incomplete query results is that late finished clustering instant stain this activeTimeLine hoodie get wrong latest base file according to https://github.com/apache/hudi/blob/55ecbc662e30068ce0ed49166d254202bd598a8c/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java#L120 To fix this bug, we need let pending clustering instant to block archive action like pending compaction did. P.S. Each ingestion will insert 602,542 records. 20211130114103632.replacecommit { "partitionToWriteStats" : { "20210623" : [ { "fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0", "path" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet", "prevCommit" : "null", "numWrites" : 602542, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 602542, "totalWriteBytes" : 17645296, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "20210623", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 17645296, "minEventTime" : null, "maxEventTime" : null } ] }, "compacted" : false, "extraMetadata" : { "schema" : "xxxxx" }, "operationType" : "CLUSTER", "partitionToReplaceFileIds" : { "20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", "a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ] }, "fileIdAndRelativePaths" : { "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet" }, "totalRecordsDeleted" : 0, "totalLogRecordsCompacted" : 0, "totalLogFilesCompacted" : 0, "totalCompactedRecordsUpdated" : 0, "totalLogFilesSize" : 0, "totalScanTime" : 0, "totalCreateTime" : 11053, "totalUpsertTime" : 0, "minAndMaxEventTime" : { "Optional.empty" : { "val" : null, "present" : false } }, "writePartitionPaths" : [ "20210623" ] } -- This message was sent by Atlassian Jira (v8.20.1#820001)