shardulm94 commented on a change in pull request #1499: URL: https://github.com/apache/iceberg/pull/1499#discussion_r494750352
########## File path: site/docs/spec.md ########## @@ -416,25 +530,91 @@ Notes: ### Delete Formats -This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version 1. This part of the spec is not yet complete and will be completed as format version 2. +This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are not supported in v1. + +Row-level delete files are valid Iceberg data files: files must use valid Iceberg formats, schemas, and column projection. It is recommended that delete files are written using the table's default file format. + +Row-level delete files are tracked by manifests, like data files. A separate set of manifests is used for delete files, but the manifest schemas are identical. -#### Position-based Delete Files +Both position and equality deletes allow encoding deleted row values with a delete. This can be used to reconstruct a stream of changes to a table. -Position-based delete files identify rows in one or more data files that have been deleted. + +#### Position Delete Files + +Position-based delete files identify deleted rows by file and position in one or more data files, and may optionally contain the deleted row. + +A data row is deleted if there is an entry in a position delete file for the row's file and position in the data file, starting at 0. Position-based delete files store `file_position_delete`, a struct with the following fields: -| Field id, name | Type | Description | -|-------------------------|---------------------------------|--------------------------------------------------------------------------------------------------------------------------| -| **`1 file_path`** | `required string` | The full URI of a data file with FS scheme. This must match the `file_path` of the target data file in a manifest entry. | -| **`2 position`** | `required long` | The ordinal position of a deleted row in the target data file identified by `file_path`, starting at `0`. | +| Field id, name | Type | Description | +|-----------------------------|------------------------|-------------| +| **`2147483546 file_path`** | `string` | Full URI or a data file with FS scheme. This must match the `file_path` of the target data file in a manifest entry | Review comment: Typo: Full URI `of` ########## File path: site/docs/spec.md ########## @@ -416,25 +530,91 @@ Notes: ### Delete Formats -This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version 1. This part of the spec is not yet complete and will be completed as format version 2. +This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are not supported in v1. + +Row-level delete files are valid Iceberg data files: files must use valid Iceberg formats, schemas, and column projection. It is recommended that delete files are written using the table's default file format. + +Row-level delete files are tracked by manifests, like data files. A separate set of manifests is used for delete files, but the manifest schemas are identical. -#### Position-based Delete Files +Both position and equality deletes allow encoding deleted row values with a delete. This can be used to reconstruct a stream of changes to a table. -Position-based delete files identify rows in one or more data files that have been deleted. + +#### Position Delete Files + +Position-based delete files identify deleted rows by file and position in one or more data files, and may optionally contain the deleted row. + +A data row is deleted if there is an entry in a position delete file for the row's file and position in the data file, starting at 0. Position-based delete files store `file_position_delete`, a struct with the following fields: -| Field id, name | Type | Description | -|-------------------------|---------------------------------|--------------------------------------------------------------------------------------------------------------------------| -| **`1 file_path`** | `required string` | The full URI of a data file with FS scheme. This must match the `file_path` of the target data file in a manifest entry. | -| **`2 position`** | `required long` | The ordinal position of a deleted row in the target data file identified by `file_path`, starting at `0`. | +| Field id, name | Type | Description | +|-----------------------------|------------------------|-------------| +| **`2147483546 file_path`** | `string` | Full URI or a data file with FS scheme. This must match the `file_path` of the target data file in a manifest entry | +| **`2147483545 pos`** | `long` | Ordinal position of a deleted row in the target data file identified by `file_path`, starting at `0` | +| **`2147483544 row`** | `required struct<...>` | Deleted row values. Omit the column when not storing deleted rows. | + +When the deleted row is present, its schema may be any subset of the table schema. When the column is present, the deleted row values for every delete must be stored. The rows in the delete file must be sorted by `file_path` then `position` to optimize filtering rows while scanning. * Sorting by `file_path` allows filter pushdown by file in columnar storage formats. * Sorting by `position` allows filtering rows while scanning, to avoid keeping deletes in memory. -Though the delete files can be written using any supported data file format in Iceberg, it is recommended to write delete files with same file format as the table's file format. +#### Equality Delete Files + +Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row. + +Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). + +A data row is deleted if its values are equal to all delete columns for any row in an equality delete file that applies to the row's data file (see [`Job Planning`](#job-planning)). Review comment: Should this be linked to `#scan-planning`? ########## File path: site/docs/spec.md ########## @@ -416,25 +530,91 @@ Notes: ### Delete Formats -This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version 1. This part of the spec is not yet complete and will be completed as format version 2. +This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are not supported in v1. + +Row-level delete files are valid Iceberg data files: files must use valid Iceberg formats, schemas, and column projection. It is recommended that delete files are written using the table's default file format. Review comment: What is the reason behind the recommendation for delete files and table data to have the same file format? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org