voonhous commented on PR #10325:
URL: https://github.com/apache/hudi/pull/10325#issuecomment-2084492493

   We encountered a similar issue internally. I think a viz will better help 
users understand this issue here:
   
   
![hus-469](https://github.com/apache/hudi/assets/6312314/dd04b173-500d-4ff0-ae4f-f47911524c09)
   
   # Explanation
   
   There are 2 jobs that were technically running concurrently, 1 writing to 
BR, the other writing to CL.
   
   BR started and ended right before CL started. 
   
   BR start: 2024-04-16 12:22:42.348
   BR end: 2024-04-16 12:26:13.996
   
   CL start: 2024-04-16 12:26:20.836
   CL end: 2024-04-1612:28:02.426
   
   The HoodieSparkCOW table was initialised at around `2024-04-16 12:26:27`, 
right when BR was performing a an archival. 
   
   The archival was not performed in order, causing holes (non-contiguity) in 
the timeline  while `org.apache.hudi.client.SparkRDDWriteClient#doInitTable` is 
being performed at `2024-04-16 12:26:27`
   
   The smallest instant `20240414122036275` was only deleted at `2024-04-16 
12:26:28.378`.
   
   This causes the executors to be initialised with a metaclient that had the 
same holes/gaps in the timeline as the table is SerDe to the executors. 
   
   When running `#getLatestBaseFilesBeforeOrOn` during tagging and when 
fetching smallest basefiles, 
`org.apache.hudi.common.model.HoodieFileGroup#getAllBaseFiles` -> 
`org.apache.hudi.common.model.HoodieFileGroup#getAllFileSlices` -> 
`org.apache.hudi.common.model.HoodieFileGroup#isFileSliceCommitted` -> 
`org.apache.hudi.common.table.timeline.HoodieDefaultTimeline#containsOrBeforeTimelineStarts`
 requires a contiguous timeline to determine if a fileslice is valid.
   
   ```java
   public boolean containsOrBeforeTimelineStarts(String instant) {
    return instants.stream().anyMatch(s -> s.getTimestamp().equals(instant)) || 
isBeforeTimelineStarts(instant);
   }
   ```
   
   Due to the non-contiguity of the timeline where `20240414122629902` is 
missing from the timeline, 
`bb1ffcb0-46f8-4a91-aeb3-f1ff6987b4c2-0_15-20-4932_20240414122629902.parquet` 
is no longer a valid fileslice, which is why it was not listed as a candidate 
file for tagging, causing a record to be updated to be inserted again into 
another parquet file, hence causing a duplicate key issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to