Re: Querying older versions of an Iceberg table

Jack Ye Sun, 16 May 2021 21:50:36 -0700

> Actually, the above is one of the reasons our customers overwrite Parquet
files. They discover that the Parquet file contains incorrect data - they
fix it by recreating a new Parquet file with the corrected data; and
replace the old version of the file with the new version of the file
bypassing Iceberg completely.


If that is the case, I am a bit confused, isn't the current Iceberg
behavior what they need? Do you want to prevent this from happening?

> The manifest files do not have a provision for specifying the version of
a data file. So, it is not really possible to read "s3 path + version"
because the manifest files do not have the version information for the data
file.

What I mean is that you can define your file location as s3 path + verison,
for example: s3://my-bucket/my/path/to/file?version=123456. If this value
is recorded as the location in DataFile and your FileIO can parse this
location, then there is no way to replace the file in that given location.

> Part of the encryption proposal does include adding a data validation
checksum or alike.

Yes Russell made a good point, the proposed encryption support would make
the process of replacing a file very hard if authenticated encryption is
enabled (although still possible to decrypt and reuse the same AAD if the
user is a super user). You can read more about it here
<https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4>

-Jack


On Sun, May 16, 2021 at 9:38 PM <[email protected]> wrote:

> Part of the encryption proposal does include adding a data validation
> checksum or alike. That could potentially work to prevent corrupted
> requests since you would get an error that the file does not match the
> checksum but would still end up with lots of weird errors if users are
> replacing files and deleting things outside of the iceberg api.
>
> Sent from my iPhone
>
> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <[email protected]>
> wrote:
>
> 
> The manifest files do not have a provision for specifying the version of a
> data file. So, it is not really possible to read "s3 path + version"
> because the manifest files do not have the version information for the data
> file.
>
> https://iceberg.apache.org/spec/#manifests
>
> On Mon, May 17, 2021 at 9:39 AM Jack Ye <[email protected]> wrote:
>
>> I actually think there is an argument against that use case of returning
>> an error after time t3. Maybe the user does want to change a row in a file
>> directly and replace the file to get an updated result quickly bypassing
>> the Iceberg API. In that case failing that query after t3 would block that
>> use case. The statistics in manifest might be wrong, but we can further
>> argue that the user can directly modify statistics and replace files all
>> the way up to the snapshot to make sure everything continues to work.
>>
>> In general, if a user decides to bypass the contract set by Iceberg, I
>> believe that we should not predict the behavior and compensate the system
>> for that behavior, because users can bypass the contract in all different
>> ways and it will open the door to satisfy many awkward use cases and in the
>> end break assumptions to the fundamentals.
>>
>> In this case you described, I think the existing Iceberg behavior makes
>> total sense. If you would like to achieve what you described later, you can
>> potentially update your FileIO and leverage the versioning feature of the
>> underlying storage to make sure that the file uploaded never has the same
>> identifier, so that users cannot replace a file at t3. For example, if you
>> are running on S3, you can enable S3 versioning, and extend the S3FIleIO so
>> that each file path is not just the s3 path, but the s3 path + version.
>>
>> But this is just what I think, let's see how others reply.
>>
>> -Jack
>>
>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <[email protected]>
>> wrote:
>>
>>> From an Iceberg perspective, I understand what you are saying.
>>>
>>> A lot of our customers add/remove files to the table using scripts. The
>>> typical workflow would be:
>>> - Create Parquet files using other tools
>>> - Add these files to the Iceberg table
>>>
>>> Similarly, for removing Parquet files from the table. I understand that
>>> Iceberg doesn't delete the data file until all snapshots that refer to that
>>> data file expire. However, the customer can delete the file directly - they
>>> might understand that a query on a snapshot will fail.
>>>
>>> I am concerned that an unintentional mistake in updating the Iceberg
>>> table results in incorrect results while querying an Iceberg snapshot. It
>>> is ok to return an error when a file referred to by a snapshot does not
>>> exist.
>>>
>>> This issue can be addressed by adding a version identifier (e.g. mtime)
>>> in the DataFile object and including this information in the manifest file.
>>> This ensures that snapshot reads are correct even when users make mistakes
>>> while adding/removing files to the table.
>>>
>>> We can work on this, if there is sufficient interest.
>>>
>>> On Sun, May 16, 2021 at 8:34 PM <[email protected]> wrote:
>>>
>>>> In the real system each file would have a unique universal identifier.
>>>> When iceberg does a delete it doesn’t actually remove the file it creates a
>>>> new meta-data file which no longer includes that file. When you attempt to
>>>> access the table of time one you were actually just reading the first
>>>> meta-data file enough the new meta-data file which is missing the entry for
>>>> the deleted file.
>>>>
>>>> The only way to end up in the scenario you describe is if you were
>>>> manually deleting files and adding files using the iceberg internal API and
>>>> not some thing like spark or flink.
>>>>
>>>> What actually happens is some thing like at
>>>> T1 metadata says f1-uuid exists
>>>>
>>>> The data is deleted
>>>> T2 metadata no longer list f1
>>>>
>>>> New data is written
>>>> T3 metadata says f3_uuid now exists
>>>>
>>>> Data files are only physically deleted by iceberg through the expire
>>>> snapshots command. This removes the snapshot meta-data as well as any data
>>>> files which are only referred to by those snap shots that are expired.
>>>>
>>>> If you are using the internal api (org.apache.iceberg.Table) then it is
>>>> your responsibility to not perform operations or delete files that would
>>>> violate the uniqueness of each snapshot. In this case you would similarly
>>>> solve the problem by just not physically deleting the file when you remove
>>>> it. Although usually having unique names every time you add data is a good
>>>> safety measure.
>>>>
>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <[email protected]>
>>>> wrote:
>>>>
>>>> 
>>>> Hi,
>>>>
>>>> I would like to understand if Iceberg supports the following scenario:
>>>>
>>>>    - At time t1, there's a table with a file f1.parquet
>>>>    - At time t2, f1.parquet is removed from the table. f1.parquet is
>>>>    also deleted from the filesystem
>>>>    - Querying table@t1 results in errors since f1.parquet is no longer
>>>>    available in the filesystem
>>>>    - At time t3, f1.parquet is recreated and added back to the table
>>>>    - Querying table@t1 now results in potentially incorrect results
>>>>    since f1.parquet is now present in the filesystem
>>>>
>>>> Should there be a version identifier for each data-file in the manifest
>>>> file to handle such scenarios?
>>>>
>>>> Thanks
>>>> Vivek
>>>>
>>>>

Re: Querying older versions of an Iceberg table

Reply via email to