lintingbin commented on code in PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#discussion_r2772092363
##########
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java:
##########
@@ -147,6 +147,32 @@ public interface RewriteDataFiles
*/
String OUTPUT_SPEC_ID = "output-spec-id";
+ /**
+ * Use Parquet row-group level merging during rewrite operations when
applicable.
+ *
+ * <p>When enabled, Parquet files will be merged at the row-group level by
directly copying row
Review Comment:
The current scenario involves Flink writing to a large wide table in
Iceberg, with a checkpoint being committed every minute.
1. This results in a small file being generated every minute. These small
files can be merged quickly, and there are no issues with merging them using
traditional methods.
2. Suppose 300 small files are merged into a single 150MB file using the
traditional approach.
3. However, new small files continue to be generated, and the next merge
would involve combining the 150MB file with 80 new small files.
4. Serializing and deserializing the 150MB file during this process would be
very slow.
5. If the 80 small files could first be merged into a larger file and then
combined with the 150MB file using rawgroup merging, the process would be much
faster while also addressing the issue of small rawgroup sizes.
This is the problem we aim to solve. Ideally, each row of data would only be
compressed once, but this is often unachievable in high-frequency write
scenarios. However, if data compressed once using traditional methods can later
be merged and compressed using rawgroup, it would still be highly desirable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]