shuwenwei opened a new pull request, #15341: URL: https://github.com/apache/iotdb/pull/15341
## Description Currently, during the settle phase of file selection, all files in a partition are traversed and categorized based on the amount of remaining data. Files are classified as either fully dirty or partial dirty. A fully dirty file is one from which all data can be deleted, while a partial dirty file contains only some deletable data. In the final compaction tasks, fully dirty files are expected to be deleted first, followed by the cleanup of partial dirty files through inner-space compaction tasks. A large number of partial dirty files may be selected within a single partition. These files are not all submitted in one compaction task. Instead, they are split into multiple tasks based on their size and count. The current splitting strategy submits all fully dirty files along with the first group of partial dirty files as one task, and each subsequent group of partial dirty files is submitted as separate tasks. This leads to a problem, as shown in the diagram: the second task contains File 5 and File 7, with a fully dirty File 6 in between. If the fully dirty File 6 has not yet been deleted by another task when Task 2 is executed, an overlap may occur between File 6 and the target files produced by the compaction, resulting in an error. <img width="1127" alt="截屏2025-04-14 18 27 50" src="https://github.com/user-attachments/assets/c8fb4f9d-eae9-4949-9bcc-c539407010f8" /> This PR change the way to submit fully dirty files. <img width="1101" alt="截屏2025-04-14 18 32 14" src="https://github.com/user-attachments/assets/0b9619a1-2a7e-45b1-b439-67a375ab1a71" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
