Hey Vivek,

I think as you can see throughout this discussion there are a number of
issues with modifying the data files outside of Iceberg APIs.  To maintain
data integrity, it's advised to only operate on the data through Iceberg.

In many ways this is similar to trying to change the history of a single
commit within git without changing anything before or after the commit,
which isn't a good idea.

Beyond the issue of file size (which is used by some engines for spit
calculation and may result in failures).  Iceberg also tracks stats at the
file level which is also used for filtering and metadata operations.  I'd
be more concerned that if the stats don't align with the new data file, you
would get incorrect results (in addition to Russel's comments).

I feel like the correct approach for this would be to use the APIs to
perform a replace operation of the file(s) in question.  This doesn't fix
your original issue of time travel, but the only alternative for that would
be to reset back to the original snapshot and then replay all of the
commits after.  That might be a better approach if you want to rewrite
history, but I don't think there's an atomic way to do that at the moment.

Just a few things to consider,
-Dan






On Sun, May 16, 2021 at 10:32 PM Vivekanand Vellanki <[email protected]>
wrote:

> Thinking about this more, if a Parquet file is deleted from the table and
> added back (both transactions using Iceberg APIs), this will lead to issues
> with snapshot reads on the table.
>
> If the fileSize changes, I believe the snapshot reads will fail (depends
> on how the reader handles these failures).
>
> If the fileSize remains unchanged but the data changed, the snapshot query
> now returns different data.
>
> On Mon, May 17, 2021 at 10:41 AM Vivekanand Vellanki <[email protected]>
> wrote:
>
>>
>>
>> On Mon, May 17, 2021 at 10:20 AM Jack Ye <[email protected]> wrote:
>>
>>> > Actually, the above is one of the reasons our customers overwrite
>>> Parquet files. They discover that the Parquet file contains incorrect data
>>> - they fix it by recreating a new Parquet file with the corrected data; and
>>> replace the old version of the file with the new version of the file
>>> bypassing Iceberg completely.
>>>
>>> If that is the case, I am a bit confused, isn't the current Iceberg
>>> behavior what they need? Do you want to prevent this from happening?
>>>
>>
>> Changing files without informing Iceberg will cause issues. I am assuming
>> that the Parquet/ORC readers will assume that the fileSize is accurate and
>> read the footer/tail assuming that to be true.
>>
>> If files are modified, Iceberg should be informed. My concern is that
>> even when Iceberg is informed, snapshot reads will return incorrect
>> results. I believe this should result in an error since snapshots should be
>> immutable and queries on snapshots should return the same results.
>>
>> > The manifest files do not have a provision for specifying the version
>>> of a data file. So, it is not really possible to read "s3 path + version"
>>> because the manifest files do not have the version information for the data
>>> file.
>>>
>>> What I mean is that you can define your file location as s3 path +
>>> verison, for example: s3://my-bucket/my/path/to/file?version=123456. If
>>> this value is recorded as the location in DataFile and your FileIO can
>>> parse this location, then there is no way to replace the file in that given
>>> location.
>>>
>>> > Part of the encryption proposal does include adding a data validation
>>> checksum or alike.
>>>
>>> Yes Russell made a good point, the proposed encryption support would
>>> make the process of replacing a file very hard if authenticated encryption
>>> is enabled (although still possible to decrypt and reuse the same AAD if
>>> the user is a super user). You can read more about it here
>>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4>
>>>
>>
>> Thanks, I will take a look at the proposal.
>>
>> I am looking for something that is supported by Iceberg natively.
>>
>> -Jack
>>>
>>>
>>> On Sun, May 16, 2021 at 9:38 PM <[email protected]> wrote:
>>>
>>>> Part of the encryption proposal does include adding a data validation
>>>> checksum or alike. That could potentially work to prevent corrupted
>>>> requests since you would get an error that the file does not match the
>>>> checksum but would still end up with lots of weird errors if users are
>>>> replacing files and deleting things outside of the iceberg api.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <[email protected]>
>>>> wrote:
>>>>
>>>> 
>>>> The manifest files do not have a provision for specifying the version
>>>> of a data file. So, it is not really possible to read "s3 path + version"
>>>> because the manifest files do not have the version information for the data
>>>> file.
>>>>
>>>> https://iceberg.apache.org/spec/#manifests
>>>>
>>>> On Mon, May 17, 2021 at 9:39 AM Jack Ye <[email protected]> wrote:
>>>>
>>>>> I actually think there is an argument against that use case of
>>>>> returning an error after time t3. Maybe the user does want to change a row
>>>>> in a file directly and replace the file to get an updated result quickly
>>>>> bypassing the Iceberg API. In that case failing that query after t3 would
>>>>> block that use case. The statistics in manifest might be wrong, but we can
>>>>> further argue that the user can directly modify statistics and replace
>>>>> files all the way up to the snapshot to make sure everything continues to
>>>>> work.
>>>>>
>>>>> In general, if a user decides to bypass the contract set by Iceberg, I
>>>>> believe that we should not predict the behavior and compensate the system
>>>>> for that behavior, because users can bypass the contract in all different
>>>>> ways and it will open the door to satisfy many awkward use cases and in 
>>>>> the
>>>>> end break assumptions to the fundamentals.
>>>>>
>>>>> In this case you described, I think the existing Iceberg behavior
>>>>> makes total sense. If you would like to achieve what you described later,
>>>>> you can potentially update your FileIO and leverage the versioning feature
>>>>> of the underlying storage to make sure that the file uploaded never has 
>>>>> the
>>>>> same identifier, so that users cannot replace a file at t3. For example, 
>>>>> if
>>>>> you are running on S3, you can enable S3 versioning, and extend the
>>>>> S3FIleIO so that each file path is not just the s3 path, but the s3 path +
>>>>> version.
>>>>>
>>>>> But this is just what I think, let's see how others reply.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> From an Iceberg perspective, I understand what you are saying.
>>>>>>
>>>>>> A lot of our customers add/remove files to the table using scripts.
>>>>>> The typical workflow would be:
>>>>>> - Create Parquet files using other tools
>>>>>> - Add these files to the Iceberg table
>>>>>>
>>>>>> Similarly, for removing Parquet files from the table. I understand
>>>>>> that Iceberg doesn't delete the data file until all snapshots that refer 
>>>>>> to
>>>>>> that data file expire. However, the customer can delete the file 
>>>>>> directly -
>>>>>> they might understand that a query on a snapshot will fail.
>>>>>>
>>>>>> I am concerned that an unintentional mistake in updating the Iceberg
>>>>>> table results in incorrect results while querying an Iceberg snapshot. It
>>>>>> is ok to return an error when a file referred to by a snapshot does not
>>>>>> exist.
>>>>>>
>>>>>> This issue can be addressed by adding a version identifier (e.g.
>>>>>> mtime) in the DataFile object and including this information in the
>>>>>> manifest file. This ensures that snapshot reads are correct even when 
>>>>>> users
>>>>>> make mistakes while adding/removing files to the table.
>>>>>>
>>>>>> We can work on this, if there is sufficient interest.
>>>>>>
>>>>>> On Sun, May 16, 2021 at 8:34 PM <[email protected]> wrote:
>>>>>>
>>>>>>> In the real system each file would have a unique universal
>>>>>>> identifier. When iceberg does a delete it doesn’t actually remove the 
>>>>>>> file
>>>>>>> it creates a new meta-data file which no longer includes that file. When
>>>>>>> you attempt to access the table of time one you were actually just 
>>>>>>> reading
>>>>>>> the first meta-data file enough the new meta-data file which is missing 
>>>>>>> the
>>>>>>> entry for the deleted file.
>>>>>>>
>>>>>>> The only way to end up in the scenario you describe is if you were
>>>>>>> manually deleting files and adding files using the iceberg internal API 
>>>>>>> and
>>>>>>> not some thing like spark or flink.
>>>>>>>
>>>>>>> What actually happens is some thing like at
>>>>>>> T1 metadata says f1-uuid exists
>>>>>>>
>>>>>>> The data is deleted
>>>>>>> T2 metadata no longer list f1
>>>>>>>
>>>>>>> New data is written
>>>>>>> T3 metadata says f3_uuid now exists
>>>>>>>
>>>>>>> Data files are only physically deleted by iceberg through the expire
>>>>>>> snapshots command. This removes the snapshot meta-data as well as any 
>>>>>>> data
>>>>>>> files which are only referred to by those snap shots that are expired.
>>>>>>>
>>>>>>> If you are using the internal api (org.apache.iceberg.Table) then it
>>>>>>> is your responsibility to not perform operations or delete files that 
>>>>>>> would
>>>>>>> violate the uniqueness of each snapshot. In this case you would 
>>>>>>> similarly
>>>>>>> solve the problem by just not physically deleting the file when you 
>>>>>>> remove
>>>>>>> it. Although usually having unique names every time you add data is a 
>>>>>>> good
>>>>>>> safety measure.
>>>>>>>
>>>>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> 
>>>>>>> Hi,
>>>>>>>
>>>>>>> I would like to understand if Iceberg supports the following
>>>>>>> scenario:
>>>>>>>
>>>>>>>    - At time t1, there's a table with a file f1.parquet
>>>>>>>    - At time t2, f1.parquet is removed from the table. f1.parquet
>>>>>>>    is also deleted from the filesystem
>>>>>>>    - Querying table@t1 results in errors since f1.parquet is no
>>>>>>>    longer available in the filesystem
>>>>>>>    - At time t3, f1.parquet is recreated and added back to the table
>>>>>>>    - Querying table@t1 now results in potentially incorrect results
>>>>>>>    since f1.parquet is now present in the filesystem
>>>>>>>
>>>>>>> Should there be a version identifier for each data-file in the
>>>>>>> manifest file to handle such scenarios?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Vivek
>>>>>>>
>>>>>>>

Reply via email to