Yue Zhang created HUDI-2144:
-------------------------------

             Summary: Offline clustering(independent sparkJob) will cause 
insert action losing data
                 Key: HUDI-2144
                 URL: https://issues.apache.org/jira/browse/HUDI-2144
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Yue Zhang
         Attachments: image-2021-07-08-13-52-00-089.png

For now we have two kinds of pipeline for Hudi using spark:
 # Streaming insert data to specific partition
 # Offline clustering spark 
job(`org.apache.hudi.utilities.HoodieClusteringJob`) to optimize file size 
pipeline 1 created

But here is a bug we met that will lose data

These steps can make the problem reproduce stably :
 # Submit a spark job to Ingest data1 using insert mode.
 # Schedule a clustering plan using 
`org.apache.hudi.utilities.HoodieClusteringJob`
 # Submit a spark job again to Ingest data2 using insert mode(Ensure that there 
is new file slice created in the same file group which means small file tuning 
for insert is working). Suppose this file group is called A and new file slice 
is called a.
 # Execute that clustering job step2 planed.
 # Query data1+data2 you will find new data for a  is lost compared with common 
ingestion without clustering

 

  !image-2021-07-08-13-52-00-089.png!

Here is the root cause:

When ingest data using insert mode, Hudi will find small files and try to 
append new data to them ,aiming to tuning data file size.

[https://github.com/apache/hudi/blob/650c4455c600b0346fed8b5b6aa4cc0bf3452e8c/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L149]

is try to filter Small Files In Clustering but only works when user set 
`hoodie.clustering.inline` true which is not good enough when users using 
offline clustering.

I just raise a PR try to fix it and tested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to