rdblue commented on code in PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#discussion_r2913230037


##########
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java:
##########
@@ -147,6 +147,32 @@ public interface RewriteDataFiles
    */
   String OUTPUT_SPEC_ID = "output-spec-id";
 
+  /**
+   * Use Parquet row-group level merging during rewrite operations when 
applicable.
+   *
+   * <p>When enabled, Parquet files will be merged at the row-group level by 
directly copying row

Review Comment:
   If I understand correctly from the comment @lintingbin wrote, it sounds like 
this is an attempt to decrease the cost of compaction when it is unstable -- 
that is, when files that have already been compacted (the 150 MB file) are 
compacted a second time. It's a little unclear, but I think the assertion in 
the last item (5) is that this is useful if you first rewrite small files to a 
larger file and then compact the larger files without rewriting row groups. 
This would mean a 2-pass approach: first rewrite the content into medium-sized 
files (and whole row groups) and then rewrite into large files with multiple 
row groups.
   
   I don't understand the value of that approach. Once you've solved the small 
files problem (~100x file count) by rewriting into larger row groups, the 
additional benefit of a second compaction is very low (~2x file count). I don't 
see why you would perform the second compaction at all if it is just 
concatenating the row groups from other files. As long as you're rewriting the 
data a second time, it makes much more sense to prepare the data for long-term 
storage and query by clustering and ordering the rows. That would significantly 
decrease overall size and speed up queries at the same time, which is worth the 
cost of the rewrite.
   
   And while you're clustering and sorting data, I doubt it makes sense to do 
the initial rewrite as well. Why incur the cost of rewriting and then not 
reorganize the data in the first pass as long as you're already rewriting to 
avoid tiny row groups?
   
   I don't see much value in exposing this -- is it really something that is 
worth supporting when it is extremely limited and has a very narrow use case 
(if any)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to