lintingbin commented on code in PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#discussion_r2772092363


##########
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java:
##########
@@ -147,6 +147,32 @@ public interface RewriteDataFiles
    */
   String OUTPUT_SPEC_ID = "output-spec-id";
 
+  /**
+   * Use Parquet row-group level merging during rewrite operations when 
applicable.
+   *
+   * <p>When enabled, Parquet files will be merged at the row-group level by 
directly copying row

Review Comment:
   The current scenario involves Flink writing to a large wide table in 
Iceberg, with a checkpoint being committed every minute.  
   1. This results in a small file being generated every minute. These small 
files can be merged quickly, and there are no issues with merging them using 
traditional methods.  
   2. Suppose 300 small files are merged into a single 150MB file using the 
traditional approach.  
   3. However, new small files continue to be generated, and the next merge 
would involve combining the 150MB file with 80 new small files.  
   4. Serializing and deserializing the 150MB file during this process would be 
very slow.  
   5. If the 80 small files could first be merged into a larger file and then 
combined with the 150MB file using rawgroup merging, the process would be much 
faster while also addressing the issue of small rawgroup sizes.  
   
   This is the problem we aim to solve. Ideally, each row of data would only be 
compressed once, but this is often unachievable in high-frequency write 
scenarios. However, if data compressed once using traditional methods can later 
be merged and compressed using rawgroup, it would still be highly desirable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to