xianyouQ opened a new issue, #10824:
URL: https://github.com/apache/iceberg/issues/10824
### Feature Request / Improvement
I propose to add parameters “from-snapshot” to RewriteDataFiles. This
parameter can be set to the snapshotId that has not been rewritten recently(20
mins for example) and RewriteDataFiles only rewrite files from “from-snapshot”
to the latest snapshot.
The most recent snapshot of the iceberg table will always have many small
files. If we can quickly process the most recent snapshot during merge
optimization, we can significantly reduce read amplification for mor reading.
At the same time, this kind of merging can be executed frequently because its
execution will be faster.
We can also run common rewrite datafile operations after multiple minor
rewrites to ensure complete removal of small files and read amplification.
### Query engine
Spark
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]