kinolaev commented on code in PR #15727:
URL: https://github.com/apache/iceberg/pull/15727#discussion_r2978224515


##########
core/src/main/java/org/apache/iceberg/DeleteFileIndex.java:
##########
@@ -205,12 +205,8 @@ private DeleteFile findDV(long seq, DataFile dataFile) {
     }
 
     DeleteFile dv = dvByPath.get(dataFile.location());
-    if (dv != null) {
-      ValidationException.check(
-          dv.dataSequenceNumber() >= seq,
-          "DV data sequence number (%s) must be greater than or equal to data 
file sequence number (%s)",
-          dv.dataSequenceNumber(),
-          seq);
+    if (dv != null && dv.dataSequenceNumber() < seq) {

Review Comment:
   We can think about a DV file that reference a data file with a greater 
sequence number either 1) as a spec violation or 2) as a dangling delete file 
for a previously deleted data file with the same name.
   In the first case, giving that different engines can work with the same 
table, I would prefer that spark ignore it instead of failing. Especially 
because there is no way in spark to fix the violation using only sql without 
java: remove dangling delete action can only be called as part of the 
rewrite_data_files procedure that will fail on this check during the scan 
before calling the action. And if you accept this PR, there will be no way to 
fix the violation with spark even using java.
   The second case, I agree, is very unlikely but still possible, and scans 
should ignore dangling delete files. I'm sorry if I've missed the part of the 
spec that makes it a spec violation.
   That is why I've proposed to relax the check. An alternative would be 
finding dangling equality deletes by copying 
DeleteFileIndex.canContainEqDeletesForFile method to the action (or making the 
class and the method public). I can do it, if it's a more appropriate solution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to