flyrain opened a new pull request, #4539: URL: https://github.com/apache/iceberg/pull/4539
The draft PR for change data capture. It largely aligns with the MVP we discussed in the [design doc](https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing) and mail list. 1. To emit delete and insert CDC records only 2. Create a Spark action for CDC generation 3. Leverage the `_deleted` metadata column for both pos deletes and eq deletes. Both deleted(pos and eq) rows are in the same format. It is a draft PR though. So there are limitations 1. For row-level deletes, it only support parquet vectorized read at this moment. 2. Multiple optimization can be done, for example, meta column `_delete` pushdown. 3. Need to expand interface to support query by timestamp instead of snapshot ids. 5. Test cases need to be added. Happy to take feedbacks and file the formal PRs. cc @aokolnychyi @RussellSpitzer @szehon-ho @jackye1995 @kbendick @karuppayya @chenjunjiedada -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
