Re: Querying older versions of an Iceberg table

Vivekanand Vellanki Sun, 16 May 2021 22:12:10 -0700

On Mon, May 17, 2021 at 10:20 AM Jack Ye <[email protected]> wrote:


> > Actually, the above is one of the reasons our customers overwrite
> Parquet files. They discover that the Parquet file contains incorrect data
> - they fix it by recreating a new Parquet file with the corrected data; and
> replace the old version of the file with the new version of the file
> bypassing Iceberg completely.
>
> If that is the case, I am a bit confused, isn't the current Iceberg
> behavior what they need? Do you want to prevent this from happening?
>

Changing files without informing Iceberg will cause issues. I am assuming
that the Parquet/ORC readers will assume that the fileSize is accurate and
read the footer/tail assuming that to be true.

If files are modified, Iceberg should be informed. My concern is that even
when Iceberg is informed, snapshot reads will return incorrect results. I
believe this should result in an error since snapshots should be immutable
and queries on snapshots should return the same results.

> The manifest files do not have a provision for specifying the version of
> a data file. So, it is not really possible to read "s3 path + version"
> because the manifest files do not have the version information for the data
> file.
>
> What I mean is that you can define your file location as s3 path +
> verison, for example: s3://my-bucket/my/path/to/file?version=123456. If
> this value is recorded as the location in DataFile and your FileIO can
> parse this location, then there is no way to replace the file in that given
> location.
>
> > Part of the encryption proposal does include adding a data validation
> checksum or alike.
>
> Yes Russell made a good point, the proposed encryption support would make
> the process of replacing a file very hard if authenticated encryption is
> enabled (although still possible to decrypt and reuse the same AAD if the
> user is a super user). You can read more about it here
> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4>
>

Thanks, I will take a look at the proposal.

I am looking for something that is supported by Iceberg natively.

-Jack
>
>
> On Sun, May 16, 2021 at 9:38 PM <[email protected]> wrote:
>
>> Part of the encryption proposal does include adding a data validation
>> checksum or alike. That could potentially work to prevent corrupted
>> requests since you would get an error that the file does not match the
>> checksum but would still end up with lots of weird errors if users are
>> replacing files and deleting things outside of the iceberg api.
>>
>> Sent from my iPhone
>>
>> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <[email protected]>
>> wrote:
>>
>> 
>> The manifest files do not have a provision for specifying the version of
>> a data file. So, it is not really possible to read "s3 path + version"
>> because the manifest files do not have the version information for the data
>> file.
>>
>> https://iceberg.apache.org/spec/#manifests
>>
>> On Mon, May 17, 2021 at 9:39 AM Jack Ye <[email protected]> wrote:
>>
>>> I actually think there is an argument against that use case of returning
>>> an error after time t3. Maybe the user does want to change a row in a file
>>> directly and replace the file to get an updated result quickly bypassing
>>> the Iceberg API. In that case failing that query after t3 would block that
>>> use case. The statistics in manifest might be wrong, but we can further
>>> argue that the user can directly modify statistics and replace files all
>>> the way up to the snapshot to make sure everything continues to work.
>>>
>>> In general, if a user decides to bypass the contract set by Iceberg, I
>>> believe that we should not predict the behavior and compensate the system
>>> for that behavior, because users can bypass the contract in all different
>>> ways and it will open the door to satisfy many awkward use cases and in the
>>> end break assumptions to the fundamentals.
>>>
>>> In this case you described, I think the existing Iceberg behavior makes
>>> total sense. If you would like to achieve what you described later, you can
>>> potentially update your FileIO and leverage the versioning feature of the
>>> underlying storage to make sure that the file uploaded never has the same
>>> identifier, so that users cannot replace a file at t3. For example, if you
>>> are running on S3, you can enable S3 versioning, and extend the S3FIleIO so
>>> that each file path is not just the s3 path, but the s3 path + version.
>>>
>>> But this is just what I think, let's see how others reply.
>>>
>>> -Jack
>>>
>>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <[email protected]>
>>> wrote:
>>>
>>>> From an Iceberg perspective, I understand what you are saying.
>>>>
>>>> A lot of our customers add/remove files to the table using scripts. The
>>>> typical workflow would be:
>>>> - Create Parquet files using other tools
>>>> - Add these files to the Iceberg table
>>>>
>>>> Similarly, for removing Parquet files from the table. I understand that
>>>> Iceberg doesn't delete the data file until all snapshots that refer to that
>>>> data file expire. However, the customer can delete the file directly - they
>>>> might understand that a query on a snapshot will fail.
>>>>
>>>> I am concerned that an unintentional mistake in updating the Iceberg
>>>> table results in incorrect results while querying an Iceberg snapshot. It
>>>> is ok to return an error when a file referred to by a snapshot does not
>>>> exist.
>>>>
>>>> This issue can be addressed by adding a version identifier (e.g. mtime)
>>>> in the DataFile object and including this information in the manifest file.
>>>> This ensures that snapshot reads are correct even when users make mistakes
>>>> while adding/removing files to the table.
>>>>
>>>> We can work on this, if there is sufficient interest.
>>>>
>>>> On Sun, May 16, 2021 at 8:34 PM <[email protected]> wrote:
>>>>
>>>>> In the real system each file would have a unique universal identifier.
>>>>> When iceberg does a delete it doesn’t actually remove the file it creates 
>>>>> a
>>>>> new meta-data file which no longer includes that file. When you attempt to
>>>>> access the table of time one you were actually just reading the first
>>>>> meta-data file enough the new meta-data file which is missing the entry 
>>>>> for
>>>>> the deleted file.
>>>>>
>>>>> The only way to end up in the scenario you describe is if you were
>>>>> manually deleting files and adding files using the iceberg internal API 
>>>>> and
>>>>> not some thing like spark or flink.
>>>>>
>>>>> What actually happens is some thing like at
>>>>> T1 metadata says f1-uuid exists
>>>>>
>>>>> The data is deleted
>>>>> T2 metadata no longer list f1
>>>>>
>>>>> New data is written
>>>>> T3 metadata says f3_uuid now exists
>>>>>
>>>>> Data files are only physically deleted by iceberg through the expire
>>>>> snapshots command. This removes the snapshot meta-data as well as any data
>>>>> files which are only referred to by those snap shots that are expired.
>>>>>
>>>>> If you are using the internal api (org.apache.iceberg.Table) then it
>>>>> is your responsibility to not perform operations or delete files that 
>>>>> would
>>>>> violate the uniqueness of each snapshot. In this case you would similarly
>>>>> solve the problem by just not physically deleting the file when you remove
>>>>> it. Although usually having unique names every time you add data is a good
>>>>> safety measure.
>>>>>
>>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <[email protected]>
>>>>> wrote:
>>>>>
>>>>> 
>>>>> Hi,
>>>>>
>>>>> I would like to understand if Iceberg supports the following scenario:
>>>>>
>>>>>    - At time t1, there's a table with a file f1.parquet
>>>>>    - At time t2, f1.parquet is removed from the table. f1.parquet is
>>>>>    also deleted from the filesystem
>>>>>    - Querying table@t1 results in errors since f1.parquet is no
>>>>>    longer available in the filesystem
>>>>>    - At time t3, f1.parquet is recreated and added back to the table
>>>>>    - Querying table@t1 now results in potentially incorrect results
>>>>>    since f1.parquet is now present in the filesystem
>>>>>
>>>>> Should there be a version identifier for each data-file in the
>>>>> manifest file to handle such scenarios?
>>>>>
>>>>> Thanks
>>>>> Vivek
>>>>>
>>>>>

Re: Querying older versions of an Iceberg table

Reply via email to