Fokko opened a new issue, #7463: URL: https://github.com/apache/iceberg/issues/7463
### Apache Iceberg version 1.2.1 (latest release) ### Query engine Spark ### Please describe the bug 🐞 Make sure to set the number of partitions to one, to avoid having files with a single record (merge on read will just discard the file when all the rows are marked as deleted). ```sql SET spark.sql.shuffle.partitions=1; ``` Create a table: ```sql create table default.t1 ( foo string, bar int, dt timestamp ) TBLPROPERTIES ( 'format-version'='2', 'write.delete.mode'='merge-on-read', 'write.update.mode'='merge-on-read', 'write.merge.mode'='merge-on-read' ) PARTITIONED BY (days(dt)); ``` Insert some data: ```sql INSERT INTO default.t1 VALUES ('a', 1, now()), ('b', 2, now()), ('c', 3, now()); ``` SELECT returns: foo | bar | dt -- | -- | -- a | 1 | 2023-04-28 13:51:28.522796 b | 2 | 2023-04-28 13:51:28.522817 c | 3 | 2023-04-28 13:51:28.522837 Delete one row: ```sql DELETE FROM default.t1 WHERE foo = 'b' ``` Look at the files: ```sql SELECT * FROM default.t1.files ``` We see one data file, and one delete file: content | file_path | file_format | spec_id | partition | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 3 | 897 | {1: 54, 2: 52, 3: 56} | {1: 3, 2: 3, 3: 3} | {1: 0, 2: 0, 3: 0} | {} | {1: bytearray(b'a'), 2: bytearray(b'\x01\x00\x00\x00'), 3: bytearray(b',\x92\x12\xc5e\xfa\x05\x00')} | {1: bytearray(b'c'), 2: bytearray(b'\x03\x00\x00\x00'), 3: bytearray(b'U\x92\x12\xc5e\xfa\x05\x00')} | None | [4] | None | 0 | Row(bar=Row(column_size=52, value_count=3, null_value_count=0, nan_value_count=None, lower_bound=1, upper_bound=3), dt=Row(column_size=56, value_count=3, null_value_count=0, nan_value_count=None, lower_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522796), upper_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522837)), foo=Row(column_size=54, value_count=3, null_value_count=0, nan_value_count=None, lower_bound='a', upper_bound='c')) 1 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2428-3743fe19-d841-4d1c-b940-bd73692b29c7-00001-deletes.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 1 | 1490 | {2147483546: 142, 2147483545: 46} | {2147483546: 1, 2147483545: 1} | {2147483546: 0, 2147483545: 0} | {} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | None | None | None | None | Row(bar=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), dt=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), foo=Row(colu mn_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None)) I don't like delete files, they slow down my queries, so let's clean them up: ```sql CALL system.rewrite_data_files(table => 'default.t1', options => map ( 'delete-file-threshold', '1' )) ``` Let's look again at the files: ```sql SELECT * FROM default.t1.files ``` This is incorrect, we still get the delete file, but the data file has been updated (we can see in the value count that there are two rows): content | file_path | file_format | spec_id | partition | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2433-3108a2a5-b3f3-4c16-849b-4c54604584f5-00001.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 2 | 909 | {1: 55, 2: 53, 3: 60} | {1: 2, 2: 2, 3: 2} | {1: 0, 2: 0, 3: 0} | {} | {1: bytearray(b'a'), 2: bytearray(b'\x01\x00\x00\x00'), 3: bytearray(b',\x92\x12\xc5e\xfa\x05\x00')} | {1: bytearray(b'c'), 2: bytearray(b'\x03\x00\x00\x00'), 3: bytearray(b'U\x92\x12\xc5e\xfa\x05\x00')} | None | [4] | None | 0 | Row(bar=Row(column_size=53, value_count=2, null_value_count=0, nan_value_count=None, lower_bound=1, upper_bound=3), dt=Row(column_size=60, value_count=2, null_value_count=0, nan_value_count=None, lower_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522796), upper_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522837)), foo=Row(column_size=55, value_count=2, null_value_count=0, nan_value_count=None, lower_bound='a', upper_bound='c')) 1 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2428-3743fe19-d841-4d1c-b940-bd73692b29c7-00001-deletes.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 1 | 1490 | {2147483546: 142, 2147483545: 46} | {2147483546: 1, 2147483545: 1} | {2147483546: 0, 2147483545: 0} | {} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | None | None | None | None | Row(bar=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), dt=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), foo=Row(colu mn_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None)) Double checking, it still looks good: ```sql SELECT * FROM default.t1 ``` foo | bar | dt -- | -- | -- a | 1 | 2023-04-28 13:51:28.522796 c | 3 | 2023-04-28 13:51:28.522837 Also in the snapshots: ```sql SELECT * FROM default.t1.snapshots ``` The summary still mentions 3 rows: committed_at | snapshot_id | parent_id | operation | manifest_list | summary -- | -- | -- | -- | -- | -- 2023-04-28 13:51:28.654000 | 1830440247885882398 | None | append | s3://warehouse/default/t1/metadata/snap-1830440247885882398-1-e17ec3b6-01a0-49e8-8be4-fffca9dbdaef.avro | {'spark.app.id': 'local-1682689536619', 'changed-partition-count': '1', 'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '3', 'total-position-deletes': '0', 'added-files-size': '897', 'total-delete-files': '0', 'total-files-size': '897', 'total-records': '3', 'total-data-files': '1'} 2023-04-28 13:51:34.738000 | 1705447835275510977 | 1830440247885882398 | overwrite | s3://warehouse/default/t1/metadata/snap-1705447835275510977-1-44a6bbb7-a1fc-4e6b-bed7-5f8a578a2a11.avro | {'spark.app.id': 'local-1682689536619', 'changed-partition-count': '1', 'added-position-deletes': '1', 'total-equality-deletes': '0', 'total-position-deletes': '1', 'added-position-delete-files': '1', 'added-files-size': '1490', 'total-delete-files': '1', 'added-delete-files': '1', 'total-files-size': '2387', 'total-records': '3', 'total-data-files': '1'} 2023-04-28 15:06:04.304000 | 1992524767636422274 | 1705447835275510977 | replace | s3://warehouse/default/t1/metadata/snap-1992524767636422274-1-8c696509-2fee-4dd8-ae4c-1729da3bfe17.avro | {'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '2', 'deleted-data-files': '1', 'deleted-records': '3', 'total-records': '2', 'removed-files-size': '897', 'changed-partition-count': '1', 'total-position-deletes': '1', 'added-files-size': '909', 'total-delete-files': '1', 'total-files-size': '2399', 'total-data-files': '1'} I thought, maybe we need to rewrite the manifests: ```sql CALL system.rewrite_manifests('default.t1') ``` rewritten_manifests_count | added_manifests_count -- | -- 2 | 1 ```sql SELECT * FROM default.t1.snapshots ``` Adds a new snapshot, but still 3 rows: committed_at | snapshot_id | parent_id | operation | manifest_list | summary -- | -- | -- | -- | -- | -- 2023-04-28 13:51:28.654000 | 1830440247885882398 | None | append | s3://warehouse/default/t1/metadata/snap-1830440247885882398-1-e17ec3b6-01a0-49e8-8be4-fffca9dbdaef.avro | {'spark.app.id': 'local-1682689536619', 'changed-partition-count': '1', 'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '3', 'total-position-deletes': '0', 'added-files-size': '897', 'total-delete-files': '0', 'total-files-size': '897', 'total-records': '3', 'total-data-files': '1'} 2023-04-28 13:51:34.738000 | 1705447835275510977 | 1830440247885882398 | overwrite | s3://warehouse/default/t1/metadata/snap-1705447835275510977-1-44a6bbb7-a1fc-4e6b-bed7-5f8a578a2a11.avro | {'spark.app.id': 'local-1682689536619', 'changed-partition-count': '1', 'added-position-deletes': '1', 'total-equality-deletes': '0', 'total-position-deletes': '1', 'added-position-delete-files': '1', 'added-files-size': '1490', 'total-delete-files': '1', 'added-delete-files': '1', 'total-files-size': '2387', 'total-records': '3', 'total-data-files': '1'} 2023-04-28 15:06:04.304000 | 1992524767636422274 | 1705447835275510977 | replace | s3://warehouse/default/t1/metadata/snap-1992524767636422274-1-8c696509-2fee-4dd8-ae4c-1729da3bfe17.avro | {'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '2', 'deleted-data-files': '1', 'deleted-records': '3', 'total-records': '2', 'removed-files-size': '897', 'changed-partition-count': '1', 'total-position-deletes': '1', 'added-files-size': '909', 'total-delete-files': '1', 'total-files-size': '2399', 'total-data-files': '1'} 2023-04-28 18:01:58.283000 | 5057249889609572796 | 1992524767636422274 | replace | s3://warehouse/default/t1/metadata/snap-5057249889609572796-1-31f5ecc8-a81f-40c6-8a9d-205a52f1c64c.avro | {'changed-partition-count': '0', 'total-equality-deletes': '0', 'manifests-created': '1', 'total-position-deletes': '1', 'total-delete-files': '1', 'manifests-kept': '0', 'manifests-replaced': '2', 'entries-processed': '0', 'total-files-size': '2399', 'total-records': '2', 'total-data-files': '1'} Also, the files are still incorrect: ```sql SELECT * FROM default.t1.files ``` content | file_path | file_format | spec_id | partition | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2433-3108a2a5-b3f3-4c16-849b-4c54604584f5-00001.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 2 | 909 | {1: 55, 2: 53, 3: 60} | {1: 2, 2: 2, 3: 2} | {1: 0, 2: 0, 3: 0} | {} | {1: bytearray(b'a'), 2: bytearray(b'\x01\x00\x00\x00'), 3: bytearray(b',\x92\x12\xc5e\xfa\x05\x00')} | {1: bytearray(b'c'), 2: bytearray(b'\x03\x00\x00\x00'), 3: bytearray(b'U\x92\x12\xc5e\xfa\x05\x00')} | None | [4] | None | 0 | Row(bar=Row(column_size=53, value_count=2, null_value_count=0, nan_value_count=None, lower_bound=1, upper_bound=3), dt=Row(column_size=60, value_count=2, null_value_count=0, nan_value_count=None, lower_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522796), upper_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522837)), foo=Row(column_size=55, value_count=2, null_value_count=0, nan_value_count=None, lower_bound='a', upper_bound='c')) 1 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2428-3743fe19-d841-4d1c-b940-bd73692b29c7-00001-deletes.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 1 | 1490 | {2147483546: 142, 2147483545: 46} | {2147483546: 1, 2147483545: 1} | {2147483546: 0, 2147483545: 0} | {} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | None | None | None | None | Row(bar=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), dt=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), foo=Row(colu mn_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None)) This looks like an inconsistency, can anyone confirm? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org