shangxinli commented on code in PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#discussion_r2777674119
##########
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java:
##########
@@ -147,6 +147,32 @@ public interface RewriteDataFiles
*/
String OUTPUT_SPEC_ID = "output-spec-id";
+ /**
+ * Use Parquet row-group level merging during rewrite operations when
applicable.
+ *
+ * <p>When enabled, Parquet files will be merged at the row-group level by
directly copying row
Review Comment:
@lintingbin the compression is at page level. If your streaming checkpoint
interval can make the typical page size (default 1MB), we can consider later to
add a PR to merge at page level. In
[Parquet](https://github.com/apache/parquet-java), we have made changes to
rewrite parquet files without decompression-and-then-compression. For example,
we do encryption (also at page level) rewrite in that way. We walk though each
page, without decoding-decompression, and immediately do encryption on that
page and send it to disk. It gains a few times faster than record by record
rewrite. But that is a more complex change. We can do that later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]