amogh-jahagirdar commented on PR #10784:
URL: https://github.com/apache/iceberg/pull/10784#issuecomment-2623396968
> For my edification, can someone please explain how duplicate file entries
in manifests can arise? Can two entries for the same file occur in a single
manifest? Can even two manifests be in the same manifest list if they overlap
(have an entry for the same file in common)? I'd have thought both of these
situations would be bugs. Or are there actual sequences of operations that lead
to such outcomes, similar to how dangling deletes can occur?
Sure, so one example of an issue that happened in the past is that in the
Kafka Connect integration we ended up appending the same file multiple times.
We rectified that in Kafka connect and the Iceberg library for duplicate
appends in the *same* snapshot, but it's still technically possible to append
the same file across different snapshots (at least in the reference
implementation and probably a few others).
Detecting overlapping files involves an expensive read through current
manifest(s) to deduplicate which if performed on every append would be
prohibitively expensive for the operation.
Overlapping files across manifests does imply a bug in whatever integration
is writing to the table. However, even after those bugs are fixed, it still
makes sense for Iceberg to expose repair procedures to correct those tables to
unblock users from using their tables.
More generally, imo it makes sense to offer a general `RepairTable`
procedure with different options which enable users to be able to correct their
tables as best as possible, in the case a bad implementation ended up writing
to them.
>Also, I understand that there was an old bug where data file size was
written incorrectly and this actually caused reads to fail, and this is the
motivation for correcting the statistics in metadata. However, that bug was
long fixed, so I wonder if there are still known situations where these
statistics need to be corrected.
Yeah I think the same I mentioned above applies where some random
implementation may end up writing incorrect statistics for whatever reason, and
it'd be good for repair table to correct that since stats are something that
can be deterministically corrected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]