Re: [PR] feat: Add table.maintenance.compact() for full-table data file compaction [iceberg-python]

via GitHub Thu, 19 Mar 2026 10:24:59 -0700


qzyu999 commented on PR #3124:
URL: https://github.com/apache/iceberg-python/pull/3124#issuecomment-4091948016

> I have been working on similar functionality for a while as part of my
upsert optimization efforts.
https://github.com/EnyMan/iceberg-python/blob/rewrite-data-files/pyiceberg/table/maintenance.py#L47,
we had used it extensively in our production environment. (10K+ rewrites) It
should be basically a clone of the Java version, and I was planning on creating
a PR, but I never got to it until now, and now I see there is already some work
being done on it. But i use Operation.OVERWRITE operation instead of replace.

Hi @EnyMan, I took a look at your code, IIUC it seems that it's taking the
new files and adding them and getting the old files and deleting them, an
`Operation.OVERWRITE` as you mentioned. I had done something similarly in the
beginning, but I now believe there is a flaw to that from the Java perspective:
- `OVERWRITE` means new data is added to overwrite existing data
- `REPLACE` means files are moved and replaced without changing the data in
the table

This has impacts for time travel and conflict resolution.
- If a snapshot is marked as `REPLACE`, the reader knows that the underlying
files were strictly restructured (e.g., compacted from 10 small files to 1
large file) but no new logical records were inserted, updated, or deleted. The
reader can safely ignore this snapshot.
- If you use `OVERWRITE` for a compaction job, downstream processes may
incorrectly perceive the compacted files as new data, potentially leading to
duplicate processing.
- During optimistic concurrency control, Iceberg uses the operation type to
determine if two concurrent commits conflict. Because `REPLACE` strictly
promises no logical changes, Iceberg's commit protocol can often safely
re-apply a REPLACE operation alongside other concurrent data modifications
(provided the specific files being replaced haven't been deleted).

For reasons that I believe are related to the above examples, @kevinjqliu
requested we first implement the `Operation.REPLACE` functionality (#3130,
#3131), and then come back to this issue/PR and complete the redesign. I saw
that your code seems to have lots of those additional features that exist in
Java's compaction function. As mentioned in #1092, the initial version of
PyIceberg's can first start with the basic harness and iterate towards the
level of completion that your implementation has in future issues/PR's.
Following this logic, I believe once #3130 and #1092 are completed, your code
would be quite valuable for quickly implementing compaction and adding those
additional features to PyIceberg.

* Insights were assisted with AI

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add table.maintenance.compact() for full-table data file compaction [iceberg-python]

Reply via email to