[GitHub] [iceberg] szehon-ho opened a new pull request, #4812: Spark 3.2: Support reading position deletes

GitBox Mon, 08 Aug 2022 19:11:10 -0700


szehon-ho opened a new pull request, #4812:
URL: https://github.com/apache/iceberg/pull/4812


   This exposes position deletes as a metadata table "position_deletes", with 
schema:  file, pos, row, partition
   
   This will be useful when trying to implement "RewritePositionDeleteFiles", 
as we will read positional deletes from Spark and then write it.  It will also 
be useful to implement "RemoveDanglingDeleteFiles", ie removing delete files 
that no longer reference live data files.  It will also be generally useful to 
get more insights into positional delete files as a table via SQL.
   
   Notes:
   
   1. Design choice:  Why via metadata table?  Initially I tried to implement 
it as a Spark Read config, but  SparkCatalog.loadTable didnt support read 
configuration options to load SparkTable with an alternate schema.  So hence 
chose metadata table path.
   2. Implementation: Most of the changes here are adding the concept of 
DeleteFileScanTask.  The FileScanTask is today bound to DataFiles 
(FileScanTask.file() returns DataFile) and we can't really change that as it's 
used in hundreds of places, so added a contentFile() method that returns the 
DataFile for DataFileScanTask or DeleteFile in case of DeleteFileScanTask.  To 
support position delete file scan, change code calls that need to scan delete 
files from FileScanTask.file() => FileScanTask.contentFile(), as all logics 
should work equally with both DeleteFile and DataFile.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho opened a new pull request, #4812: Spark 3.2: Support reading position deletes

Reply via email to