
Yue Zhang updated HUDI-2892:

Step 1 
Do a normal hudi insert 

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 

Step 2 
Build a clustering plan but don't execute this plan
20211130114103632.replacecommit.requested will cluster data files from 

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 

Step 3 
Do a few times hudi insert and trigger several archivals

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:44 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:41 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:42 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:44 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:23 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 

20211130114122881.commit 20211130114207164.commit and 20211130114351703.commit 
were archived.

Step 4 
Do query to check record numbers and based hudi data files.
val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
|4217794 |

val frame = spark.sql("select distinct(_hoodie_file_name) from 
hudi_test").show(10000, false)
|_hoodie_file_name                                                     |

Step 5 
Stop insert and trigger that pending clustering replace request

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:27 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  4736 11 30 13:27 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:27 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 

Step 6 
Do the same queries to check record numbers and based hudi data files.

val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
|2410168 |

val frame = spark.sql("select distinct(_hoodie_file_name) from 
hudi_test").show(10000, false)
|_hoodie_file_name                                                     |

As we can see, we get different query result compared with before-clustering 
and after-clustering.
Also query result from Step 6 is missing records from these base file mentioned 





The root cause of this incomplete query results is that late finished 
clustering instant stain this activeTimeLine hoodie get wrong latest base file 
according to 

To fix this bug, we need let pending clustering instant to block archive action 
like pending compaction did.

Each ingestion will insert 602,542 records.

  "partitionToWriteStats" : {
    "20210623" : [ {
      "fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0",
      "path" : 
      "prevCommit" : "null",
      "numWrites" : 602542,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 602542,
      "totalWriteBytes" : 17645296,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "20210623",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 17645296,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "xxxxx"
  "operationType" : "CLUSTER",
  "partitionToReplaceFileIds" : {
    "20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", 
"a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ]
  "fileIdAndRelativePaths" : {
    "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : 
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 11053,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
  "writePartitionPaths" : [ "20210623" ]





**Describe the problem you faced**
If there's a pending clustering instant still existed in active timeline after 
several archival actions.
Next time we finish this pending clustering instant, this clustering instant 
may stain the ActiveTimeLine and lead to incomplete query results

**To Reproduce**


**Step 1** 
Do a normal hudi insert 
drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 

**Step 2** 
Build a clustering plan but don't execute this plan
20211130114103632.replacecommit.requested will cluster data files from 

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 

**Step 3** 
Do a few times hudi insert and trigger several archivals

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:44 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:41 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:42 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:44 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:23 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 

20211130114122881.commit 20211130114207164.commit and 20211130114351703.commit 
were archived.

**Step 4** 
Do query to check record numbers and based hudi data files.
val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
|4217794 |

val frame = spark.sql("select distinct(_hoodie_file_name) from 
hudi_test").show(10000, false)
|_hoodie_file_name                                                     |

**Step 5** 
Stop insert and trigger that pending clustering replace request

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:27 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  4736 11 30 13:27 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:27 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 

**Step 6** 
Do the same queries to check record numbers and based hudi data files.

val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
|2410168 |

val frame = spark.sql("select distinct(_hoodie_file_name) from 
hudi_test").show(10000, false)
|_hoodie_file_name                                                     |

As we can see, we get different query result compared with before-clustering 
and after-clustering.
Also query result from Step 6 is missing records from these base file mentioned 




The root cause of this incomplete query results is that late finished 
clustering instant stain this activeTimeLine hoodie get wrong latest base file 
according to 

To fix this bug, we need let pending clustering instant to block archive action 
like pending compaction did.

Each ingestion will insert 602,542 records.
  "partitionToWriteStats" : {
    "20210623" : [ {
      "fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0",
      "path" : 
      "prevCommit" : "null",
      "numWrites" : 602542,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 602542,
      "totalWriteBytes" : 17645296,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "20210623",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 17645296,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "xxxxx"
  "operationType" : "CLUSTER",
  "partitionToReplaceFileIds" : {
    "20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", 
"a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ]
  "fileIdAndRelativePaths" : {
    "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : 
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 11053,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
  "writePartitionPaths" : [ "20210623" ]





**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version : master

* Spark version : 2.4.4

* Hive version :

* Hadoop version :

* Storage (HDFS/S3/GCS..) : 

* Running on Docker? (yes/no) :

**Additional context**

Add any other context about the problem here.


```Add the stacktrace of the error.```


> Pending Clustering may stain the ActiveTimeLine and lead to incomplete query 
> results
> ------------------------------------------------------------------------------------
>                 Key: HUDI-2892
>                 URL: https://issues.apache.org/jira/browse/HUDI-2892
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Yue Zhang
>            Priority: Major
> Step 1 
> Do a normal hudi insert 
> drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
> drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
> 20211130113918979.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
> 20211130113918979.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
> 20211130113918979.inflight
> drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 
> archived/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 
> hoodie.properties
> Step 2 
> Build a clustering plan but don't execute this plan
> 20211130114103632.replacecommit.requested will cluster data files from 
> 20211130113918979.commit
> drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
> drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
> 20211130113918979.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
> 20211130113918979.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
> 20211130113918979.inflight
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 
> 20211130114103632.replacecommit.requested
> drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 
> archived/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 
> hoodie.properties
> Step 3 
> Do a few times hudi insert and trigger several archivals
> drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
> drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:44 .temp/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 
> 20211130113918979.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
> 20211130113918979.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 
> 20211130113918979.inflight
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 
> 20211130114103632.replacecommit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:41 
> 20211130114122881.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 
> 20211130114122881.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 
> 20211130114122881.inflight
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:42 
> 20211130114207164.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 
> 20211130114207164.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 
> 20211130114207164.inflight
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:44 
> 20211130114351703.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 
> 20211130114351703.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 
> 20211130114351703.inflight
> drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 
> archived/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 
> hoodie.properties
> drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
> drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:23 .temp/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 
> 20211130114103632.replacecommit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 
> 20211130131825336.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
> 20211130131825336.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
> 20211130131825336.inflight
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
> 20211130132256488.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
> 20211130132256488.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
> 20211130132256488.inflight
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
> 20211130132327154.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
> 20211130132327154.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
> 20211130132327154.inflight
> drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 
> archived/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 
> hoodie.properties
> 20211130114122881.commit 20211130114207164.commit and 
> 20211130114351703.commit were archived.
> Step 4 
> Do query to check record numbers and based hudi data files.
> val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
> => 
> +--------+
> |count(1)|
> +--------+
> |4217794 |
> +--------+
> val frame = spark.sql("select distinct(_hoodie_file_name) from 
> hudi_test").show(10000, false)
> =>
> +----------------------------------------------------------------------+
> |_hoodie_file_name                                                     |
> +----------------------------------------------------------------------+
> |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
> |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
> |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
> |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
> |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
> |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
> |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
> |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
> |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
> |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|
> |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
> |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
> |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|
> |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|
> +----------------------------------------------------------------------+
> Step 5 
> Stop insert and trigger that pending clustering replace request
> drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
> drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:27 .temp/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  4736 11 30 13:27 
> 20211130114103632.replacecommit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:27 
> 20211130114103632.replacecommit.inflight
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 
> 20211130114103632.replacecommit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 
> 20211130131825336.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
> 20211130131825336.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 
> 20211130131825336.inflight
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
> 20211130132256488.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
> 20211130132256488.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 
> 20211130132256488.inflight
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 
> 20211130132327154.commit
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
> 20211130132327154.commit.requested
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 
> 20211130132327154.inflight
> drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 
> archived/
> -rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 
> hoodie.properties
> Step 6 
> Do the same queries to check record numbers and based hudi data files.
> val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
> => 
> +--------+
> |count(1)|
> +--------+
> |2410168 |
> +--------+
> val frame = spark.sql("select distinct(_hoodie_file_name) from 
> hudi_test").show(10000, false)
> =>
> +----------------------------------------------------------------------+
> |_hoodie_file_name                                                     |
> +----------------------------------------------------------------------+
> |caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
> |ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
> |73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
> |a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
> |7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
> |00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
> |a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
> |d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
> +----------------------------------------------------------------------+
> As we can see, we get different query result compared with before-clustering 
> and after-clustering.
> Also query result from Step 6 is missing records from these base file 
> mentioned below.
> |12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
> |9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|
> |823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
> |eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|
> |b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
> |4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|
> The root cause of this incomplete query results is that late finished 
> clustering instant stain this activeTimeLine hoodie get wrong latest base 
> file according to 
> https://github.com/apache/hudi/blob/55ecbc662e30068ce0ed49166d254202bd598a8c/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java#L120
> To fix this bug, we need let pending clustering instant to block archive 
> action like pending compaction did.
> P.S.
> Each ingestion will insert 602,542 records.
> 20211130114103632.replacecommit
> {
>   "partitionToWriteStats" : {
>     "20210623" : [ {
>       "fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0",
>       "path" : 
> "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet",
>       "prevCommit" : "null",
>       "numWrites" : 602542,
>       "numDeletes" : 0,
>       "numUpdateWrites" : 0,
>       "numInserts" : 602542,
>       "totalWriteBytes" : 17645296,
>       "totalWriteErrors" : 0,
>       "tempPath" : null,
>       "partitionPath" : "20210623",
>       "totalLogRecords" : 0,
>       "totalLogFilesCompacted" : 0,
>       "totalLogSizeCompacted" : 0,
>       "totalUpdatedRecordsCompacted" : 0,
>       "totalLogBlocks" : 0,
>       "totalCorruptLogBlock" : 0,
>       "totalRollbackBlocks" : 0,
>       "fileSizeInBytes" : 17645296,
>       "minEventTime" : null,
>       "maxEventTime" : null
>     } ]
>   },
>   "compacted" : false,
>   "extraMetadata" : {
>     "schema" : "xxxxx"
>   },
>   "operationType" : "CLUSTER",
>   "partitionToReplaceFileIds" : {
>     "20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", 
> "a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ]
>   },
>   "fileIdAndRelativePaths" : {
>     "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : 
> "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet"
>   },
>   "totalRecordsDeleted" : 0,
>   "totalLogRecordsCompacted" : 0,
>   "totalLogFilesCompacted" : 0,
>   "totalCompactedRecordsUpdated" : 0,
>   "totalLogFilesSize" : 0,
>   "totalScanTime" : 0,
>   "totalCreateTime" : 11053,
>   "totalUpsertTime" : 0,
>   "minAndMaxEventTime" : {
>     "Optional.empty" : {
>       "val" : null,
>       "present" : false
>     }
>   },
>   "writePartitionPaths" : [ "20210623" ]
> }

This message was sent by Atlassian Jira

Reply via email to