Fokko commented on PR #363:
URL: https://github.com/apache/iceberg-python/pull/363#issuecomment-2209460864

   Doing some testing with `avro-tools`, asserting the state after 5 append 
operations with `"commit.manifest.min-count-to-merge": "2"`
   
   # V1 Table
   
   ## Manifest-list
   
   ### 5th manifest-list
   
   ```json
   {
       "manifest_path": 
"/tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro",
       "manifest_length": 6878,
       "partition_spec_id": 0,
       "content": 0,
       "sequence_number": 5,
       "min_sequence_number": 1,
       "added_snapshot_id": 6508090689697406000,
       "added_files_count": 1,
       "existing_files_count": 4,
       "deleted_files_count": 0,
       "added_rows_count": 3,
       "existing_rows_count": 12,
       "deleted_rows_count": 0,
       "partitions": {
           "array": []
       },
       "key_metadata": null
   }
   ```
   
   ### 4th manifest-list
   
   ```json
   {
       "manifest_path": 
"/tmp/some.db/table/metadata/88807344-0e23-413c-827e-2a9ec63c6233-m1.avro",
       "manifest_length": 6436,
       "partition_spec_id": 0,
       "content": 0,
       "sequence_number": 4,
       "min_sequence_number": 1,
       "added_snapshot_id": 3455109142449701000,
       "added_files_count": 1,
       "existing_files_count": 3,
       "deleted_files_count": 0,
       "added_rows_count": 3,
       "existing_rows_count": 9,
       "deleted_rows_count": 0,
       "partitions": {
           "array": []
       },
       "key_metadata": null
   }
   ```
   
   
   ## Manifests
   
   We have 5 manifests as expected:
   
   ```
   avro-tools tojson 
/tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro | wc 
-l 
          5
   ```
   
   ### Last one:
   
   ```json
   {
       "status": 1,
       "snapshot_id": {
           "long": 6508090689697406000
       },
       "data_sequence_number": null,
       "file_sequence_number": null,
       "data_file": {
           "content": 0,
           "file_path": 
"/tmp/some.db/table/data/00000-0-80ba9f84-99af-4af1-b8f5-4caa254645c2.parquet",
           "file_format": "PARQUET",
           "partition": {},
           "record_count": 3,
           "file_size_in_bytes": 5459,
           "column_sizes": { ... },
           "value_counts": { ... },
           "null_value_counts": { ... },
           "nan_value_counts": { ... },
           "lower_bounds": { ... },
           "upper_bounds": { ... },
           "key_metadata": null,
           "split_offsets": {
               "array": [
                   4
               ]
           },
           "equality_ids": null,
           "sort_order_id": null
       }
   }
   ```
   
   ### First one:
   
   ```json
   {
       "status": 0,
       "snapshot_id": {
           "long": 6508090689697406000
       },
       "data_sequence_number": {
           "long": 1
       },
       "file_sequence_number": {
           "long": 1
       },
       "data_file": {
           "content": 0,
           "file_path": 
"/tmp/some.db/table/data/00000-0-bbd4029c-510a-48e6-a905-ab5b69a832e8.parquet",
           "file_format": "PARQUET",
           "partition": {},
           "record_count": 3,
           "file_size_in_bytes": 5459,
           "column_sizes": { ... },
           "value_counts": { ... },
           "null_value_counts": { ... },
           "nan_value_counts": { ... },
           "lower_bounds": { ... },
           "upper_bounds": { ... },
           "key_metadata": null,
           "split_offsets": {
               "array": [
                   4
               ]
           },
           "equality_ids": null,
           "sort_order_id": null
       }
   }
   ```
   
   This looks good, except for one thing: the `snapshot_id` is off, as from the 
spec:
   
   > Snapshot id where the file was added, or deleted if status is 2. Inherited 
when null.
   
   This should be the ID of the first append operation.
   
   # V2 Table
   
   ## Manifest list
   
   ### 5th manifest-list
   
   ```json
   {
       "manifest_path": 
"/tmp/some.db/tablev2/metadata/93717a88-1cea-4e3d-a69a-00ce3d087822-m1.avro",
       "manifest_length": 6883,
       "partition_spec_id": 0,
       "content": 0,
       "sequence_number": 5,
       "min_sequence_number": 1,
       "added_snapshot_id": 898025966831056900,
       "added_files_count": 1,
       "existing_files_count": 4,
       "deleted_files_count": 0,
       "added_rows_count": 3,
       "existing_rows_count": 12,
       "deleted_rows_count": 0,
       "partitions": {
           "array": []
       },
       "key_metadata": null
   }
   ```
   
   ### 4th manifest-list
   
   ```json
   {
       "manifest_path": 
"/tmp/some.db/tablev2/metadata/5c64a07c-4b8a-4be1-a751-d4fd339560e2-m0.avro",
       "manifest_length": 5127,
       "partition_spec_id": 0,
       "content": 0,
       "sequence_number": 1,
       "min_sequence_number": 1,
       "added_snapshot_id": 1343032504684197000,
       "added_files_count": 1,
       "existing_files_count": 0,
       "deleted_files_count": 0,
       "added_rows_count": 3,
       "existing_rows_count": 0,
       "deleted_rows_count": 0,
       "partitions": {
           "array": []
       },
       "key_metadata": null
   }
   ```
   
   ## Manifests
   
   ### last manifest file in manifest-list
   
   ```json
   {
       "status": 1,
       "snapshot_id": {
           "long": 898025966831056900
       },
       "data_sequence_number": null,
       "file_sequence_number": null,
       "data_file": {
           "content": 0,
           "file_path": 
"/tmp/some.db/tablev2/data/00000-0-93717a88-1cea-4e3d-a69a-00ce3d087822.parquet",
           "file_format": "PARQUET",
           "partition": {},
           "record_count": 3,
           "file_size_in_bytes": 5459,
           "column_sizes": { ... },
           "value_counts": { ... },
           "null_value_counts": { ... },
           "nan_value_counts": { ... },
           "lower_bounds": { ... },
           "upper_bounds": { ... },
           "key_metadata": null,
           "split_offsets": {
               "array": [
                   4
               ]
           },
           "equality_ids": null,
           "sort_order_id": null
       }
   }
   ```
   
   ### First manifest in manifest-list
   
   ```json
   {
       "status": 0,
       "snapshot_id": {
           "long": 898025966831056900
       },
       "data_sequence_number": {
           "long": 1
       },
       "file_sequence_number": {
           "long": 1
       },
       "data_file": {
           "content": 0,
           "file_path": 
"/tmp/some.db/tablev2/data/00000-0-5c64a07c-4b8a-4be1-a751-d4fd339560e2.parquet",
           "file_format": "PARQUET",
           "partition": {},
           "record_count": 3,
           "file_size_in_bytes": 5459,
           "column_sizes": { ... },
           "value_counts": { ... },
           "null_value_counts": { ... },
           "nan_value_counts": { ... },
           "lower_bounds": { ... },
           "upper_bounds": { ... },
           "key_metadata": null,
           "split_offsets": {
               "array": [
                   4
               ]
           },
           "equality_ids": null,
           "sort_order_id": null
       }
   }
   ```
   
   Except for the snapshot-id and 
https://github.com/apache/iceberg-python/issues/893 this looks great! 🥳 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to