pvary opened a new issue, #5339:
URL: https://github.com/apache/iceberg/issues/5339

   During reviewing #4904 I found the following with a slightly modified 
`TestIcebergInputFormats.testFilterExp` test:
   ```java
   [..]
       helper = new TestHelper(conf, tables, location.toString(), SCHEMA, SPEC, 
fileFormat, temp);
   [..]
       helper.createTable();
   
       List<Record> expectedRecords = helper.generateRandomRecords(2, 0L);
       expectedRecords.get(0).set(2, "2020-03-20");
       expectedRecords.get(1).set(2, "2020-03-20");
   
       DataFile dataFile1 = helper.writeFile(Row.of("2020-03-20", 0), 
expectedRecords);
       DataFile dataFile2 = helper.writeFile(Row.of("2020-03-21", 0), 
helper.generateRandomRecords(2, 0L));
       helper.appendToTable(dataFile1, dataFile2); // This creates a 
transaction and adds the data files to it using 'table.newAppend()'
   
       // Adding the same files again to the same table
       helper.appendToTable(dataFile1, dataFile2);
   ```
   
   The test basically adds the same data file twice for the Iceberg table.
   The result is that the table will contain duplicate rows, which is what I 
would expect if we do not want to prevent this situation in the first place.
   
   I have not tested yet, but based on the specification it is not possible to 
deduplicate the data using any of the V2 delete formats. It is only possible 
with knowledge about the data and the data files of the Iceberg table.
   
   Question for the community:
   - Do we think that this is an expected behaviour?
   - Do we want to prevent this situation by checking the uniqueness of the 
file names when adding new data files to a table? What should we do in this 
case?
       - Throw an exception?
       - Log a warning message, and skip adding the file?
   
   My first instinct would be to prevent adding the same file to the same table 
again and throw an exception, but I would like to see how others think about 
this issue.
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to