[GitHub] [iceberg] jackye1995 commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

GitBox Thu, 23 Sep 2021 23:14:06 -0700


jackye1995 commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r715340822




##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -94,9 +102,12 @@ protected void validate(TableMetadata base) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, 
!validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
-      if (conflictDetectionFilter != null) {
-        validateAddedDataFiles(base, startingSnapshotId, 
conflictDetectionFilter, caseSensitive);
+      if (appendConflictDetectionFilter != null) {
+        validateAddedDataFiles(base, startingSnapshotId, 
appendConflictDetectionFilter, caseSensitive);
+      }
+
+      if (deleteConflictDetectionFilter != null) {
+        validateNoNewDeletes(base, startingSnapshotId, 
deleteConflictDetectionFilter, caseSensitive);

Review comment:
       A bit late to the whole discussion. Regarding the check, I read the 
outlined way to optimize it, just want to share some thoughts based on what I 
am doing for position deletes of my internal distribution as of today. 
   
   In my system, each position delete file written contains exactly 1 
`file_path` value, which avoids the requirement from the spec to sort by file 
path and also greatly simplifies the validation during concurrent commits, 
because each check can easily find all position deletes of each data file and 
check against just the position min max to see if there is any potential 
overlapping of the position range. Of course this cannot be applied to a 
general use case, it was implemented just to see what can be achieved with a 
closed system where all delete writers only write that specific type of 
position delete file.
   
   When I started to compact position delete files to contain multiple 
`file_path` values, it becomes very easy to have false-positives, especially in 
the object storage mode where the `file_path` min and max does not really mean 
anything anymore. So at least from the object storage use case, secondary index 
with much better file skipping ability is a must have to make the strategy 
described truly work efficiently.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Reply via email to