rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r705787905
##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends
CharSequence> referenc
}
@Override
- public RowDelta validateNoConflictingAppends(Expression
newConflictDetectionFilter) {
+ public RowDelta validateNoConflictingOperations(Expression
newConflictDetectionFilter) {
Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict
detection filter cannot be null");
this.conflictDetectionFilter = newConflictDetectionFilter;
return this;
}
+ @Override
+ public RowDelta validateNoConflictingDeleteFiles() {
+ this.validateNoConflictingDeleteFiles = true;
+ return this;
+ }
+
@Override
protected void validate(TableMetadata base) {
if (base.currentSnapshot() != null) {
if (!referencedDataFiles.isEmpty()) {
validateDataFilesExist(base, startingSnapshotId, referencedDataFiles,
!validateDeletes);
}
- // TODO: does this need to check new delete files?
if (conflictDetectionFilter != null) {
validateAddedDataFiles(base, startingSnapshotId,
conflictDetectionFilter, caseSensitive);
}
+
+ if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles)
{
+ validateAddedDeleteFiles(base, startingSnapshotId,
conflictDetectionFilter, caseSensitive);
Review comment:
If I understand correctly, the motivation for updating `RowDelta` is the
case where we have two concurrent delta commits? So an UPDATE and a MERGE at
the same time might both rewrite a row, which could cause a duplicate:
```sql
INSERT INTO t VALUES (1, 'a'), (2, 'b'), (3, 'c');
-- running these concurrently causes a problem
UPDATE t SET data = 'x' WHERE id = 1;
UPDATE t SET data = 'y' WHERE id = 1;
```
If I ran the updates concurrently, both would delete id=1 and both would add
a new file with `(1, 'x')` and `(1, 'y')` right?
The validation here is that the file created by the initial insert doesn't
have any new delete files written against it. It seems like we want to just
call `validateNoNewDeletesForDataFiles` and pass `referencedFiles` in, right?
Maybe I'm missing something?
We might want to make this a separate issue to keep changes smaller and
reviews easier.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]