Sagar Sumit created HUDI-6888:
---------------------------------
Summary: Optimize scanInternalv2 to single pass
Key: HUDI-6888
URL: https://issues.apache.org/jira/browse/HUDI-6888
Project: Apache Hudi
Issue Type: Task
Reporter: Sagar Sumit
Fix For: 1.0.0
The current algorithm take two passes over the log blocks: # First pass to
collect all the valid blocks alongwith block instant times including rollback
block's target instant time.
# Second pass, in rever order of block instant time, to track final compacted
instant times for each block.
Now that we have removed appending to the same log file for multiple
deltacommits, we can probably scan in single pass by keeping an active list or
hash map of block times to their corresponding block, updating as we go. Should
be tested for:
# Out of order merged blocks: Log compaction is scheduled and by the time it
appended a block, another block is added by another writer.
# Log compaction operation failed, so a rollback is issued for this block.
Here the rollback can be next block or can come at a later point of time.
# Log compaction is executing and, before committing, compaction starts
running on the same file group.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)