On Mon, May 17, 2021 at 10:20 AM Jack Ye <[email protected]> wrote:
> > Actually, the above is one of the reasons our customers overwrite > Parquet files. They discover that the Parquet file contains incorrect data > - they fix it by recreating a new Parquet file with the corrected data; and > replace the old version of the file with the new version of the file > bypassing Iceberg completely. > > If that is the case, I am a bit confused, isn't the current Iceberg > behavior what they need? Do you want to prevent this from happening? > Changing files without informing Iceberg will cause issues. I am assuming that the Parquet/ORC readers will assume that the fileSize is accurate and read the footer/tail assuming that to be true. If files are modified, Iceberg should be informed. My concern is that even when Iceberg is informed, snapshot reads will return incorrect results. I believe this should result in an error since snapshots should be immutable and queries on snapshots should return the same results. > The manifest files do not have a provision for specifying the version of > a data file. So, it is not really possible to read "s3 path + version" > because the manifest files do not have the version information for the data > file. > > What I mean is that you can define your file location as s3 path + > verison, for example: s3://my-bucket/my/path/to/file?version=123456. If > this value is recorded as the location in DataFile and your FileIO can > parse this location, then there is no way to replace the file in that given > location. > > > Part of the encryption proposal does include adding a data validation > checksum or alike. > > Yes Russell made a good point, the proposed encryption support would make > the process of replacing a file very hard if authenticated encryption is > enabled (although still possible to decrypt and reuse the same AAD if the > user is a super user). You can read more about it here > <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4> > Thanks, I will take a look at the proposal. I am looking for something that is supported by Iceberg natively. -Jack > > > On Sun, May 16, 2021 at 9:38 PM <[email protected]> wrote: > >> Part of the encryption proposal does include adding a data validation >> checksum or alike. That could potentially work to prevent corrupted >> requests since you would get an error that the file does not match the >> checksum but would still end up with lots of weird errors if users are >> replacing files and deleting things outside of the iceberg api. >> >> Sent from my iPhone >> >> On May 16, 2021, at 11:23 PM, Vivekanand Vellanki <[email protected]> >> wrote: >> >> >> The manifest files do not have a provision for specifying the version of >> a data file. So, it is not really possible to read "s3 path + version" >> because the manifest files do not have the version information for the data >> file. >> >> https://iceberg.apache.org/spec/#manifests >> >> On Mon, May 17, 2021 at 9:39 AM Jack Ye <[email protected]> wrote: >> >>> I actually think there is an argument against that use case of returning >>> an error after time t3. Maybe the user does want to change a row in a file >>> directly and replace the file to get an updated result quickly bypassing >>> the Iceberg API. In that case failing that query after t3 would block that >>> use case. The statistics in manifest might be wrong, but we can further >>> argue that the user can directly modify statistics and replace files all >>> the way up to the snapshot to make sure everything continues to work. >>> >>> In general, if a user decides to bypass the contract set by Iceberg, I >>> believe that we should not predict the behavior and compensate the system >>> for that behavior, because users can bypass the contract in all different >>> ways and it will open the door to satisfy many awkward use cases and in the >>> end break assumptions to the fundamentals. >>> >>> In this case you described, I think the existing Iceberg behavior makes >>> total sense. If you would like to achieve what you described later, you can >>> potentially update your FileIO and leverage the versioning feature of the >>> underlying storage to make sure that the file uploaded never has the same >>> identifier, so that users cannot replace a file at t3. For example, if you >>> are running on S3, you can enable S3 versioning, and extend the S3FIleIO so >>> that each file path is not just the s3 path, but the s3 path + version. >>> >>> But this is just what I think, let's see how others reply. >>> >>> -Jack >>> >>> On Sun, May 16, 2021 at 8:52 PM Vivekanand Vellanki <[email protected]> >>> wrote: >>> >>>> From an Iceberg perspective, I understand what you are saying. >>>> >>>> A lot of our customers add/remove files to the table using scripts. The >>>> typical workflow would be: >>>> - Create Parquet files using other tools >>>> - Add these files to the Iceberg table >>>> >>>> Similarly, for removing Parquet files from the table. I understand that >>>> Iceberg doesn't delete the data file until all snapshots that refer to that >>>> data file expire. However, the customer can delete the file directly - they >>>> might understand that a query on a snapshot will fail. >>>> >>>> I am concerned that an unintentional mistake in updating the Iceberg >>>> table results in incorrect results while querying an Iceberg snapshot. It >>>> is ok to return an error when a file referred to by a snapshot does not >>>> exist. >>>> >>>> This issue can be addressed by adding a version identifier (e.g. mtime) >>>> in the DataFile object and including this information in the manifest file. >>>> This ensures that snapshot reads are correct even when users make mistakes >>>> while adding/removing files to the table. >>>> >>>> We can work on this, if there is sufficient interest. >>>> >>>> On Sun, May 16, 2021 at 8:34 PM <[email protected]> wrote: >>>> >>>>> In the real system each file would have a unique universal identifier. >>>>> When iceberg does a delete it doesn’t actually remove the file it creates >>>>> a >>>>> new meta-data file which no longer includes that file. When you attempt to >>>>> access the table of time one you were actually just reading the first >>>>> meta-data file enough the new meta-data file which is missing the entry >>>>> for >>>>> the deleted file. >>>>> >>>>> The only way to end up in the scenario you describe is if you were >>>>> manually deleting files and adding files using the iceberg internal API >>>>> and >>>>> not some thing like spark or flink. >>>>> >>>>> What actually happens is some thing like at >>>>> T1 metadata says f1-uuid exists >>>>> >>>>> The data is deleted >>>>> T2 metadata no longer list f1 >>>>> >>>>> New data is written >>>>> T3 metadata says f3_uuid now exists >>>>> >>>>> Data files are only physically deleted by iceberg through the expire >>>>> snapshots command. This removes the snapshot meta-data as well as any data >>>>> files which are only referred to by those snap shots that are expired. >>>>> >>>>> If you are using the internal api (org.apache.iceberg.Table) then it >>>>> is your responsibility to not perform operations or delete files that >>>>> would >>>>> violate the uniqueness of each snapshot. In this case you would similarly >>>>> solve the problem by just not physically deleting the file when you remove >>>>> it. Although usually having unique names every time you add data is a good >>>>> safety measure. >>>>> >>>>> On May 16, 2021, at 4:53 AM, Vivekanand Vellanki <[email protected]> >>>>> wrote: >>>>> >>>>> >>>>> Hi, >>>>> >>>>> I would like to understand if Iceberg supports the following scenario: >>>>> >>>>> - At time t1, there's a table with a file f1.parquet >>>>> - At time t2, f1.parquet is removed from the table. f1.parquet is >>>>> also deleted from the filesystem >>>>> - Querying table@t1 results in errors since f1.parquet is no >>>>> longer available in the filesystem >>>>> - At time t3, f1.parquet is recreated and added back to the table >>>>> - Querying table@t1 now results in potentially incorrect results >>>>> since f1.parquet is now present in the filesystem >>>>> >>>>> Should there be a version identifier for each data-file in the >>>>> manifest file to handle such scenarios? >>>>> >>>>> Thanks >>>>> Vivek >>>>> >>>>>
