voonhous commented on PR #10325: URL: https://github.com/apache/hudi/pull/10325#issuecomment-2084492493
We encountered a similar issue internally. I think a viz will better help users understand this issue here: ![hus-469](https://github.com/apache/hudi/assets/6312314/dd04b173-500d-4ff0-ae4f-f47911524c09) # Explanation There are 2 jobs that were technically running concurrently, 1 writing to BR, the other writing to CL. BR started and ended right before CL started. BR start: 2024-04-16 12:22:42.348 BR end: 2024-04-16 12:26:13.996 CL start: 2024-04-16 12:26:20.836 CL end: 2024-04-1612:28:02.426 The HoodieSparkCOW table was initialised at around `2024-04-16 12:26:27`, right when BR was performing a an archival. The archival was not performed in order, causing holes (non-contiguity) in the timeline while `org.apache.hudi.client.SparkRDDWriteClient#doInitTable` is being performed at `2024-04-16 12:26:27` The smallest instant `20240414122036275` was only deleted at `2024-04-16 12:26:28.378`. This causes the executors to be initialised with a metaclient that had the same holes/gaps in the timeline as the table is SerDe to the executors. When running `#getLatestBaseFilesBeforeOrOn` during tagging and when fetching smallest basefiles, `org.apache.hudi.common.model.HoodieFileGroup#getAllBaseFiles` -> `org.apache.hudi.common.model.HoodieFileGroup#getAllFileSlices` -> `org.apache.hudi.common.model.HoodieFileGroup#isFileSliceCommitted` -> `org.apache.hudi.common.table.timeline.HoodieDefaultTimeline#containsOrBeforeTimelineStarts` requires a contiguous timeline to determine if a fileslice is valid. ```java public boolean containsOrBeforeTimelineStarts(String instant) { return instants.stream().anyMatch(s -> s.getTimestamp().equals(instant)) || isBeforeTimelineStarts(instant); } ``` Due to the non-contiguity of the timeline where `20240414122629902` is missing from the timeline, `bb1ffcb0-46f8-4a91-aeb3-f1ff6987b4c2-0_15-20-4932_20240414122629902.parquet` is no longer a valid fileslice, which is why it was not listed as a candidate file for tagging, causing a record to be updated to be inserted again into another parquet file, hence causing a duplicate key issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org