Fokko commented on PR #363:
URL: https://github.com/apache/iceberg-python/pull/363#issuecomment-2209460864
Doing some testing with `avro-tools`, asserting the state after 5 append
operations with `"commit.manifest.min-count-to-merge": "2"`
# V1 Table
## Manifest-list
### 5th manifest-list
```json
{
"manifest_path":
"/tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro",
"manifest_length": 6878,
"partition_spec_id": 0,
"content": 0,
"sequence_number": 5,
"min_sequence_number": 1,
"added_snapshot_id": 6508090689697406000,
"added_files_count": 1,
"existing_files_count": 4,
"deleted_files_count": 0,
"added_rows_count": 3,
"existing_rows_count": 12,
"deleted_rows_count": 0,
"partitions": {
"array": []
},
"key_metadata": null
}
```
### 4th manifest-list
```json
{
"manifest_path":
"/tmp/some.db/table/metadata/88807344-0e23-413c-827e-2a9ec63c6233-m1.avro",
"manifest_length": 6436,
"partition_spec_id": 0,
"content": 0,
"sequence_number": 4,
"min_sequence_number": 1,
"added_snapshot_id": 3455109142449701000,
"added_files_count": 1,
"existing_files_count": 3,
"deleted_files_count": 0,
"added_rows_count": 3,
"existing_rows_count": 9,
"deleted_rows_count": 0,
"partitions": {
"array": []
},
"key_metadata": null
}
```
## Manifests
We have 5 manifests as expected:
```
avro-tools tojson
/tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro | wc
-l
5
```
### Last one:
```json
{
"status": 1,
"snapshot_id": {
"long": 6508090689697406000
},
"data_sequence_number": null,
"file_sequence_number": null,
"data_file": {
"content": 0,
"file_path":
"/tmp/some.db/table/data/00000-0-80ba9f84-99af-4af1-b8f5-4caa254645c2.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 3,
"file_size_in_bytes": 5459,
"column_sizes": { ... },
"value_counts": { ... },
"null_value_counts": { ... },
"nan_value_counts": { ... },
"lower_bounds": { ... },
"upper_bounds": { ... },
"key_metadata": null,
"split_offsets": {
"array": [
4
]
},
"equality_ids": null,
"sort_order_id": null
}
}
```
### First one:
```json
{
"status": 0,
"snapshot_id": {
"long": 6508090689697406000
},
"data_sequence_number": {
"long": 1
},
"file_sequence_number": {
"long": 1
},
"data_file": {
"content": 0,
"file_path":
"/tmp/some.db/table/data/00000-0-bbd4029c-510a-48e6-a905-ab5b69a832e8.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 3,
"file_size_in_bytes": 5459,
"column_sizes": { ... },
"value_counts": { ... },
"null_value_counts": { ... },
"nan_value_counts": { ... },
"lower_bounds": { ... },
"upper_bounds": { ... },
"key_metadata": null,
"split_offsets": {
"array": [
4
]
},
"equality_ids": null,
"sort_order_id": null
}
}
```
This looks good, except for one thing: the `snapshot_id` is off, as from the
spec:
> Snapshot id where the file was added, or deleted if status is 2. Inherited
when null.
This should be the ID of the first append operation.
# V2 Table
## Manifest list
### 5th manifest-list
```json
{
"manifest_path":
"/tmp/some.db/tablev2/metadata/93717a88-1cea-4e3d-a69a-00ce3d087822-m1.avro",
"manifest_length": 6883,
"partition_spec_id": 0,
"content": 0,
"sequence_number": 5,
"min_sequence_number": 1,
"added_snapshot_id": 898025966831056900,
"added_files_count": 1,
"existing_files_count": 4,
"deleted_files_count": 0,
"added_rows_count": 3,
"existing_rows_count": 12,
"deleted_rows_count": 0,
"partitions": {
"array": []
},
"key_metadata": null
}
```
### 4th manifest-list
```json
{
"manifest_path":
"/tmp/some.db/tablev2/metadata/5c64a07c-4b8a-4be1-a751-d4fd339560e2-m0.avro",
"manifest_length": 5127,
"partition_spec_id": 0,
"content": 0,
"sequence_number": 1,
"min_sequence_number": 1,
"added_snapshot_id": 1343032504684197000,
"added_files_count": 1,
"existing_files_count": 0,
"deleted_files_count": 0,
"added_rows_count": 3,
"existing_rows_count": 0,
"deleted_rows_count": 0,
"partitions": {
"array": []
},
"key_metadata": null
}
```
## Manifests
### last manifest file in manifest-list
```json
{
"status": 1,
"snapshot_id": {
"long": 898025966831056900
},
"data_sequence_number": null,
"file_sequence_number": null,
"data_file": {
"content": 0,
"file_path":
"/tmp/some.db/tablev2/data/00000-0-93717a88-1cea-4e3d-a69a-00ce3d087822.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 3,
"file_size_in_bytes": 5459,
"column_sizes": { ... },
"value_counts": { ... },
"null_value_counts": { ... },
"nan_value_counts": { ... },
"lower_bounds": { ... },
"upper_bounds": { ... },
"key_metadata": null,
"split_offsets": {
"array": [
4
]
},
"equality_ids": null,
"sort_order_id": null
}
}
```
### First manifest in manifest-list
```json
{
"status": 0,
"snapshot_id": {
"long": 898025966831056900
},
"data_sequence_number": {
"long": 1
},
"file_sequence_number": {
"long": 1
},
"data_file": {
"content": 0,
"file_path":
"/tmp/some.db/tablev2/data/00000-0-5c64a07c-4b8a-4be1-a751-d4fd339560e2.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 3,
"file_size_in_bytes": 5459,
"column_sizes": { ... },
"value_counts": { ... },
"null_value_counts": { ... },
"nan_value_counts": { ... },
"lower_bounds": { ... },
"upper_bounds": { ... },
"key_metadata": null,
"split_offsets": {
"array": [
4
]
},
"equality_ids": null,
"sort_order_id": null
}
}
```
Except for the snapshot-id and
https://github.com/apache/iceberg-python/issues/893 this looks great! 🥳
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]