> Actually, the above is one of the reasons our customers overwrite Parquet files. They discover that the Parquet file contains incorrect data - they fix it by recreating a new Parquet file with the corrected data; and replace the old version of the file with the new version of the file bypassing Iceberg completely.
If that is the case, I am a bit confused, isn't the current Iceberg behavior what they need? Do you want to prevent this from happening? > The manifest files do not have a provision for specifying the version of a data file. So, it is not really possible to read "s3 path + version" because the manifest files do not have the version information for the data file. What I mean is that you can define your file location as s3 path + verison, for example: s3://my-bucket/my/path/to/file?version=123456. If this value is recorded as the location in DataFile and your FileIO can parse this location, then there is no way to replace the file in that given location. > Part of the encryption proposal does include adding a data validation checksum or alike. Yes Russell made a good point, the proposed encryption support would make the process of replacing a file very hard if authenticated encryption is enabled (although still possible to decrypt and reuse the same AAD if the user is a super user). You can read more about it here <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4> -Jack On Sun, May 16, 2021 at 9:38 PM <[email protected]> wrote: > Part of the encryption proposal does include adding a data validation > checksum or alike. That could potentially work to prevent corrupted > requests since you would get an error that the file does not match the > checksum but would still end up with lots of weird errors if users are > replacing files and deleting things outside of the iceberg api. > > Sent from my iPhone > > On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <[email protected]> > wrote: > > > The manifest files do not have a provision for specifying the version of a > data file. So, it is not really possible to read "s3 path + version" > because the manifest files do not have the version information for the data > file. > > https://iceberg.apache.org/spec/#manifests > > On Mon, May 17, 2021 at 9:39 AM Jack Ye <[email protected]> wrote: > >> I actually think there is an argument against that use case of returning >> an error after time t3. Maybe the user does want to change a row in a file >> directly and replace the file to get an updated result quickly bypassing >> the Iceberg API. In that case failing that query after t3 would block that >> use case. The statistics in manifest might be wrong, but we can further >> argue that the user can directly modify statistics and replace files all >> the way up to the snapshot to make sure everything continues to work. >> >> In general, if a user decides to bypass the contract set by Iceberg, I >> believe that we should not predict the behavior and compensate the system >> for that behavior, because users can bypass the contract in all different >> ways and it will open the door to satisfy many awkward use cases and in the >> end break assumptions to the fundamentals. >> >> In this case you described, I think the existing Iceberg behavior makes >> total sense. If you would like to achieve what you described later, you can >> potentially update your FileIO and leverage the versioning feature of the >> underlying storage to make sure that the file uploaded never has the same >> identifier, so that users cannot replace a file at t3. For example, if you >> are running on S3, you can enable S3 versioning, and extend the S3FIleIO so >> that each file path is not just the s3 path, but the s3 path + version. >> >> But this is just what I think, let's see how others reply. >> >> -Jack >> >> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <[email protected]> >> wrote: >> >>> From an Iceberg perspective, I understand what you are saying. >>> >>> A lot of our customers add/remove files to the table using scripts. The >>> typical workflow would be: >>> - Create Parquet files using other tools >>> - Add these files to the Iceberg table >>> >>> Similarly, for removing Parquet files from the table. I understand that >>> Iceberg doesn't delete the data file until all snapshots that refer to that >>> data file expire. However, the customer can delete the file directly - they >>> might understand that a query on a snapshot will fail. >>> >>> I am concerned that an unintentional mistake in updating the Iceberg >>> table results in incorrect results while querying an Iceberg snapshot. It >>> is ok to return an error when a file referred to by a snapshot does not >>> exist. >>> >>> This issue can be addressed by adding a version identifier (e.g. mtime) >>> in the DataFile object and including this information in the manifest file. >>> This ensures that snapshot reads are correct even when users make mistakes >>> while adding/removing files to the table. >>> >>> We can work on this, if there is sufficient interest. >>> >>> On Sun, May 16, 2021 at 8:34 PM <[email protected]> wrote: >>> >>>> In the real system each file would have a unique universal identifier. >>>> When iceberg does a delete it doesn’t actually remove the file it creates a >>>> new meta-data file which no longer includes that file. When you attempt to >>>> access the table of time one you were actually just reading the first >>>> meta-data file enough the new meta-data file which is missing the entry for >>>> the deleted file. >>>> >>>> The only way to end up in the scenario you describe is if you were >>>> manually deleting files and adding files using the iceberg internal API and >>>> not some thing like spark or flink. >>>> >>>> What actually happens is some thing like at >>>> T1 metadata says f1-uuid exists >>>> >>>> The data is deleted >>>> T2 metadata no longer list f1 >>>> >>>> New data is written >>>> T3 metadata says f3_uuid now exists >>>> >>>> Data files are only physically deleted by iceberg through the expire >>>> snapshots command. This removes the snapshot meta-data as well as any data >>>> files which are only referred to by those snap shots that are expired. >>>> >>>> If you are using the internal api (org.apache.iceberg.Table) then it is >>>> your responsibility to not perform operations or delete files that would >>>> violate the uniqueness of each snapshot. In this case you would similarly >>>> solve the problem by just not physically deleting the file when you remove >>>> it. Although usually having unique names every time you add data is a good >>>> safety measure. >>>> >>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <[email protected]> >>>> wrote: >>>> >>>> >>>> Hi, >>>> >>>> I would like to understand if Iceberg supports the following scenario: >>>> >>>> - At time t1, there's a table with a file f1.parquet >>>> - At time t2, f1.parquet is removed from the table. f1.parquet is >>>> also deleted from the filesystem >>>> - Querying table@t1 results in errors since f1.parquet is no longer >>>> available in the filesystem >>>> - At time t3, f1.parquet is recreated and added back to the table >>>> - Querying table@t1 now results in potentially incorrect results >>>> since f1.parquet is now present in the filesystem >>>> >>>> Should there be a version identifier for each data-file in the manifest >>>> file to handle such scenarios? >>>> >>>> Thanks >>>> Vivek >>>> >>>>
