[ https://issues.apache.org/jira/browse/HUDI-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Udit Mehrotra updated HUDI-2144: -------------------------------- Status: In Progress (was: Open) > Offline clustering(independent sparkJob) will cause insert action losing data > ----------------------------------------------------------------------------- > > Key: HUDI-2144 > URL: https://issues.apache.org/jira/browse/HUDI-2144 > Project: Apache Hudi > Issue Type: Bug > Reporter: Yue Zhang > Priority: Major > Labels: pull-request-available > Attachments: image-2021-07-08-13-52-00-089.png > > > For now we have two kinds of pipeline for Hudi using spark: > # Streaming insert data to specific partition > # Offline clustering spark > job(`org.apache.hudi.utilities.HoodieClusteringJob`) to optimize file size > pipeline 1 created > But here is a bug we met that will lose data > These steps can make the problem reproduce stably : > # Submit a spark job to Ingest data1 using insert mode. > # Schedule a clustering plan using > `org.apache.hudi.utilities.HoodieClusteringJob` > # Submit a spark job again to Ingest data2 using insert mode(Ensure that > there is new file slice created in the same file group which means small file > tuning for insert is working). Suppose this file group is called file group 1 > and new file slice is called file slice 2. > # Execute that clustering job step2 planed. > # Query data1+data2 you will find new data for a is lost compared with > common ingestion without clustering > > !image-2021-07-08-13-52-00-089.png|width=922,height=728! > Here is the root cause: > When ingest data using insert mode, Hudi will find small files and try to > append new data to them ,aiming to tuning data file size. > [https://github.com/apache/hudi/blob/650c4455c600b0346fed8b5b6aa4cc0bf3452e8c/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L149] > is try to filter Small Files In Clustering but only works when user set > `hoodie.clustering.inline` true which is not good enough when users using > offline clustering. > I just raise a PR try to fix it and tested. -- This message was sent by Atlassian Jira (v8.3.4#803005)