nsivabalan commented on issue #3739: URL: https://github.com/apache/hudi/issues/3739#issuecomment-931525996
thanks. Here is what is possibly happening. If you can tigger more updates, eventually you will see cleaning kicking in. In short, this has something to do w/ MOR table. cleaner need to see N commits before it can clean things up not delta commits. I will try to explain what that means, but its gonna be lengthy. Let me first try to explains data files and delta log files. In hudi, base or data files are parquet format and delta log files are avro with .log extension. base files are created w/ commits and log files are created w/ delta commits. Each data file could have 0 or more log files. They represent updates to data in the respective data files/base files. For instance, here is a simple example. base_file_1_c1, log_file_1_c1_v1 log_file_1_c1_v2 base_file_2_c2 In above example, there are two commits made, c1, c2 and c3. C1 : base_file_1_c1 C2: base_file_2_c2 and add some updates to base_file_1 and so log_file_1_c1_v1 got created. C3: Added some updates to base_file_1 and so log_file_1_c1_v2 got created. So, if we make making more commits similar to C3, only new log files will be added. These are not considered as commits from a cleaning stand point. Hudi has something called compaction which compacts base files and corresponding log files into a new version of the base file. Lets say compaction kicks in with commit time C4. base_file_1_c1, log_file_1_c1_v1 log_file_1_c1_v2 base_file_1_c4, base_file_2_c2 base_file_1_c4 is nothing but (base_file_1_c1 + log_file_1_c1_v1 + log_file_1_c1_v2) Now, lets say you have configured cleaner commits retained as 1, then (base_file_1_c1 + log_file_1_c1_v1 + log_file_1_c1_v2) would have been cleaned up. But as you could see, compaction created a newer version of this base file and hence older version is eligible to be cleaned up. which is not the case for base_file_2_c2. Bcoz, there is only one version. So, in your case, only when 4 or 5 compactions happen, you could possibly see cleaner kicking in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org