szehon-ho opened a new pull request, #4812: URL: https://github.com/apache/iceberg/pull/4812
This exposes position deletes as a metadata table "position_deletes", with schema: file, pos, row, partition This will be useful when trying to implement "RewritePositionDeleteFiles", as we will read positional deletes from Spark and then write it. It will also be useful to implement "RemoveDanglingDeleteFiles", ie removing delete files that no longer reference live data files. It will also be generally useful to get more insights into positional delete files as a table via SQL. Notes: 1. Design choice: Why via metadata table? Initially I tried to implement it as a Spark Read config, but SparkCatalog.loadTable didnt support read configuration options to load SparkTable with an alternate schema. So hence chose metadata table path. 2. Implementation: Most of the changes here are adding the concept of DeleteFileScanTask. The FileScanTask is today bound to DataFiles (FileScanTask.file() returns DataFile) and we can't really change that as it's used in hundreds of places, so added a contentFile() method that returns the DataFile for DataFileScanTask or DeleteFile in case of DeleteFileScanTask. To support position delete file scan, change code calls that need to scan delete files from FileScanTask.file() => FileScanTask.contentFile(), as all logics should work equally with both DeleteFile and DataFile. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
