[ https://issues.apache.org/jira/browse/HUDI-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540154#comment-17540154 ]
Yue Zhang edited comment on HUDI-1575 at 5/20/22 2:58 PM: ---------------------------------------------------------- Eager conflict detection based on marker For now we have three base HoodieWriteHandle implements which are HoodieCreateHandle, HoodieMergeHandle and HoodieAppendHandle. They all will create a new marker file during initialized before actually writing data. We can do this eager conflict detection before creating marker file, details are as followed: First we need to create a TaskTransactionManager at task level which hold their own taskLocker, let take ZK lock as example. Then we try lock partitionPath + "/" + fileId on ZK before creating marker file. After that we need to do conflict detection: 1. List `.temp` directory and try to find all the marker files which contains `partitionPath + "/" + fileId` prefix. (we can do list improvement here and don't need to list all the dir.) 2. If the list result is not empty, it means that there is a conflict caused by another inflight ingestion job. Then we need to fail current ingestio. On the contrary there is no inflight conflict. 3. Then we also need to make sure there is no committed conflict which finished before conflict detection(the corresponding marker files are already deleted). We need to reload activetimeline, getLatestFileSlice/getLatestBaseFile and compares it with the original one. If not equaled , we also failed current ingestion. Then create marker file Finally release this file group level lock. was (Author: zhangyue19921010): Eager conflict detection based on marker file For now we have three base HoodieWriteHandle implements which are HoodieCreateHandle, HoodieMergeHandle and HoodieAppendHandle. They all will create a new marker file during initialized before actually writing data. We can do conflict detection during create marker file, details are as followed: First we need to new a TaskTransactionManager at task level which hold their own taskLocker, let take ZK lock as example. Then we try lock partitionPath + "/" + fileId on ZK before create marker file. After that we need to do conflict detection: 1. List .temp dictionary and try to find all the marker file contains partitionPath + "/" + fileId prefix. (we can do list improvement here and don't need to list all the dic.) 2. If the list result is not empty, it means that there is a conflict caused by another inflight ingestion. Then we need to fail current ingestion job. On the contrary there is no inflight conflict. 3. Then we also need to make sure there is no committed conflict which finished before conflict detection(the corresponding marker files are deleted). We need to reolad activetimeline, getLatestFileSlice/getLatestBaseFile and compres it with the original one. If not equaled , we also failed current ingestion. 4. Then create marker file 5. Finally release this fiel group level lock > Early detection by periodically checking last written commit & active markers > ----------------------------------------------------------------------------- > > Key: HUDI-1575 > URL: https://issues.apache.org/jira/browse/HUDI-1575 > Project: Apache Hudi > Issue Type: New Feature > Components: writer-core > Reporter: Nishith Agarwal > Assignee: Yue Zhang > Priority: Blocker > Fix For: 0.12.0 > > > Check if there are more commits, try to do resolution based on its current > markers, and abort for a currently running job to avoid using up resources > and running a concurrent job if we already found a commit that happened in > the meantime. This can give back so much of the cluster early and > dramatically lower costs in the cloud. -- This message was sent by Atlassian Jira (v8.20.7#820007)