[jira] [Updated] (HUDI-2144) Offline clustering(independent sparkJob) will cause insert action losing data

Udit Mehrotra (Jira) Fri, 30 Jul 2021 18:19:07 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Udit Mehrotra updated HUDI-2144:
--------------------------------
    Status: In Progress  (was: Open)

> Offline clustering(independent sparkJob) will cause insert action losing data
> -----------------------------------------------------------------------------
>
>                 Key: HUDI-2144
>                 URL: https://issues.apache.org/jira/browse/HUDI-2144
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Yue Zhang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2021-07-08-13-52-00-089.png
>
>
> For now we have two kinds of pipeline for Hudi using spark:
>  # Streaming insert data to specific partition
>  # Offline clustering spark 
> job(`org.apache.hudi.utilities.HoodieClusteringJob`) to optimize file size 
> pipeline 1 created
> But here is a bug we met that will lose data
> These steps can make the problem reproduce stably ：
>  # Submit a spark job to Ingest data1 using insert mode.
>  # Schedule a clustering plan using 
> `org.apache.hudi.utilities.HoodieClusteringJob`
>  # Submit a spark job again to Ingest data2 using insert mode(Ensure that 
> there is new file slice created in the same file group which means small file 
> tuning for insert is working). Suppose this file group is called file group 1 
> and new file slice is called file slice 2.
>  # Execute that clustering job step2 planed.
>  # Query data1+data2 you will find new data for a  is lost compared with 
> common ingestion without clustering
>  
>   !image-2021-07-08-13-52-00-089.png|width=922,height=728!
> Here is the root cause:
> When ingest data using insert mode, Hudi will find small files and try to 
> append new data to them ,aiming to tuning data file size.
> [https://github.com/apache/hudi/blob/650c4455c600b0346fed8b5b6aa4cc0bf3452e8c/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L149]
> is try to filter Small Files In Clustering but only works when user set 
> `hoodie.clustering.inline` true which is not good enough when users using 
> offline clustering.
> I just raise a PR try to fix it and tested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2144) Offline clustering(independent sparkJob) will cause insert action losing data

Reply via email to