rdsr commented on a change in pull request #1499: URL: https://github.com/apache/iceberg/pull/1499#discussion_r494070093
########## File path: site/docs/spec.md ########## @@ -208,136 +256,202 @@ Notes: ### Manifests -A manifest is an immutable Avro file that lists a set of data files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a snapshot, which tracks all of the files in a table at some point in time. +A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a [snapshot](#snapshots), which tracks all of the files in a table at some point in time. Manifests are tracked by a [manifest list](#manifest-lists) for each table snapshot. + +A manifest is a valid Iceberg data file: files must use valid Iceberg formats, schemas, and column projection. + +A manifest may store either data files or delete files, but not both because manifests that contain delete files are scanned first during job planning. Whether a manifest is a data manifest or a delete manifest is stored in manifest metadata. -A manifest is a valid Iceberg data file. Files must use Iceberg schemas and column projection. +A manifest stores files for a single partition spec. When a table’s partition spec changes, old files remain in the older manifest and newer files are written to a new manifest. This is required because a manifest file’s schema is based on its partition spec (see below). The partition spec of each manifest is also used to transform predicates on the table's data rows into predicates on partition values that are used during job planning to select files from a manifest. -A manifest stores files for a single partition spec. When a table’s partition spec changes, old files remain in the older manifest and newer files are written to a new manifest. This is required because a manifest file’s schema is based on its partition spec (see below). This restriction also simplifies selecting files from a manifest because the same boolean expression can be used to select or filter all rows. +A manifest file must store the partition spec and other metadata as properties in the Avro file's key-value metadata: -The partition spec for a manifest and the current table schema must be stored in the key-value properties of the manifest file. The partition spec is stored as a JSON string under the key `partition-spec`. The table schema is stored as a JSON string under the key `schema`. +| v1 | v2 | Key | Value | +|------------|------------|---------------------|------------------------------------------------------------------------------| +| _required_ | _required_ | `schema` | JSON representation of the table schema at the time the manifest was written | +| _required_ | _required_ | `partition-spec` | JSON fields representation of the partition spec used to write the manifest | +| _optional_ | _required_ | `partition-spec-id` | Id of the partition spec used to write the manifest as a string | +| _optional_ | _required_ | `format-version` | Table format version number of the manifest as a string | +| | _required_ | `content` | Type of content files tracked by the manifest: "data" or "deletes" | The schema of a manifest file is a struct called `manifest_entry` with the following fields: -| Field id, name | Type | Description | -|----------------------|-----------------------------------------------------------|-----------------------------------------------------------------| -| **`0 status`** | `int` with meaning: `0: EXISTING` `1: ADDED` `2: DELETED` | Used to track additions and deletions | -| **`1 snapshot_id`** | `long` | Snapshot id where the file was added, or deleted if status is 2 | -| **`2 data_file`** | `data_file` `struct` (see below) | File path, partition tuple, metrics, ... | +| v1 | v2 | Field id, name | Type | Description | +| ---------- | ---------- |--------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------------| +| _required_ | _required_ | **`0 status`** | `int` with meaning: `0: EXISTING` `1: ADDED` `2: DELETED` | Used to track additions and deletions | +| _required_ | _optional_ | **`1 snapshot_id`** | `long` | Snapshot id where the file was added, or deleted if status is 2. Inherited when null. | +| | _optional_ | **`3 sequence_number`** | `long` | Sequence number when the file was added. Inherited when null. | +| _required_ | _required_ | **`2 data_file`** | `data_file` `struct` (see below) | File path, partition tuple, metrics, ... | `data_file` is a struct with the following fields: -| Field id, name | Type | Description | -|-----------------------------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| **`101 file_format`** | `string` | String file format name, avro, orc or parquet | -| **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec | -| **`103 record_count`** | `long` | Number of records in this file | -| **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default value and do not read.** | -| ~~**`106 file_ordinal`**~~ | `optional int` | **Deprecated. Do not use.** | -| ~~**`107 sort_columns`**~~ | `optional list` | **Deprecated. Do not use.** | -| **`108 column_sizes`** | `optional map` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro). | -| **`109 value_counts`** | `optional map` | Map from column id to number of values in the column (including null values) | -| **`110 null_value_counts`** | `optional map` | Map from column id to number of null values in the column | -| ~~**`111 distinct_counts`**~~ | `optional map` | **Deprecated. Do not use.** | -| **`125 lower_bounds`** | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file. | -| **`128 upper_bounds`** | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file. | -| **`131 key_metadata`** | `optional binary` | Implementation-specific key metadata for encryption | -| **`132 split_offsets`** | `optional list` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending. | +| v1 | v2 | Field id, name | Type | Description | +| ---------- | ---------- |-----------------------------------|------------------------------|-------------| +| | _required_ | **`134 content`** | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | +| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | +| _required_ | _required_ | **`101 file_format`** | `string` | String file format name, avro, orc or parquet | +| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec | +| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | +| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | +| _required_ | | ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | +| _optional_ | | ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | +| _optional_ | | ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | +| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | +| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null values) | +| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | +| _optional_ | | ~~**`111 distinct_counts`**~~ | `map<123: int, 124: long>` | **Deprecated. Do not write.** | +| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file | +| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file | +| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | +| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | +| | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | Notes: 1. Single-value serialization for lower and upper bounds is detailed in Appendix D. -The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec for the manifest file. +The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec. -Each manifest file must store its partition spec and the current table schema in the Avro file’s key-value metadata. The partition spec is used to transform predicates on the table’s data rows into predicates on the manifest’s partition values during job planning. +The column metrics maps are used when filtering to select both data and delete files. For delete files, the metrics must store bounds and counts for all deleted rows, or must be omitted. Storing metrics for deleted rows ensures that the values can be used during job planning to find delete files that must be merged during a scan. #### Manifest Entry Fields The manifest entry fields are used to keep track of the snapshot in which files were added or logically deleted. The `data_file` struct is nested inside of the manifest entry so that it can be easily passed to job planning without the manifest entry fields. -When a data file is added to the dataset, it’s manifest entry should store the snapshot ID in which the file was added and set status to 1 (added). +When a file is added to the dataset, it’s manifest entry should store the snapshot ID in which the file was added and set status to 1 (added). -When a data file is replaced or deleted from the dataset, it’s manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1]. +When a file is replaced or deleted from the dataset, it’s manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1]. + +Iceberg v2 adds a sequence number to the entry and makes the snapshot id optional. Both fields, `sequence_number` and `snapshot_id`, are inherited from manifest metadata when `null`. That is, if the field is `null` for an entry, then the entry must inherit its value from the manifest file's metadata, stored in the manifest list [2]. Notes: 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires. +2. Manifest list files are required in v2, so that the `sequence_number` and `snapshot_id` to inherit are always available. + +#### Sequence Number Inheritance + +Manifests track the sequence number when a data or delete file was added to the table. + +When adding new file, its sequence number is set to `null` because the snapshot's sequence number is not assigned until the snapshot is successfully committed. When reading, sequence numbers are inherited by replacing `null` with the manifest's sequence number from the manifest list. + +When writing an existing file to a new manifest, the sequence number must be non-null and set to the sequence number that was inherited. + +Inheriting sequence numbers through the metadata tree allows writing a new manifest without a known sequence number, so that a manifest can be written once and reused in commit retries. To change a sequence number for a retry, only the manifest list must be rewritten. + +When reading v1 manifests with no sequence number column, sequence numbers for all files must default to 0. + ### Snapshots A snapshot consists of the following fields: -* **`snapshot-id`** -- A unique long ID. -* **`parent-snapshot-id`** -- (Optional) The snapshot ID of the snapshot’s parent. This field is not present for snapshots that have no parent snapshot, such as snapshots created before this field was added or the first snapshot of a table. -* **`sequence-number`** -- A monotonically increasing long that tracks the order of snapshots in a table. (**v2 only**) -* **`timestamp-ms`** -- A timestamp when the snapshot was created. This is used when garbage collecting snapshots. -* **`manifests`** -- A list of manifest file locations. The data files in a snapshot are the union of all data files listed in these manifests. (Deprecated in favor of `manifest-list`) -* **`manifest-list`** -- (Optional) The location of a manifest list file for this snapshot, which contains a list of manifest files with additional metadata. If present, the manifests field must be omitted. -* **`summary`** -- (Optional) A summary that encodes the `operation` that produced the snapshot and other relevant information specific to that operation. This allows some operations like snapshot expiration to skip processing some snapshots. Possible values of `operation` are: - * `append` -- Data files were added and no files were removed. - * `replace` -- Data files were rewritten with the same data; i.e., compaction, changing the data file format, or relocating data files. - * `overwrite` -- Data files were deleted and added in a logical overwrite operation. - * `delete` -- Data files were removed and their contents logically deleted. +| v1 | v2 | Field | Description | +| ---------- | ---------- | ------------------------ | ----------- | +| _required_ | _required_ | **`snapshot-id`** | A unique long ID | +| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the snapshot's parent. Omitted for any snapshot with no parent | +| | _required_ | **`sequence-number`** | A monotonically increasing long that tracks the order of changes to a table | +| _required_ | _required_ | **`timestamp-ms`** | A timestamp when the snapshot was created, used for garbage collection and table inspection | +| _optional_ | _required_ | **`manifest-list`** | The location of a manifest list for this snapshot that tracks manifest files with additional meadata | +| _optional_ | | **`manifests`** | A list of manifest file locations. Must be omitted if `manifest-list` is present | +| _optional_ | _required_ | **`summary`** | A string map that summarizes the snapshot changes, including `operation` (see below) | -Snapshots can be split across more than one manifest. This enables: +The snapshot summary's `operation` field is used by some operations, like snapshot expiration, to skip processing certain snapshots. Possible `operation` values are: + +* `append` -- Only data files were added and no files were removed. +* `replace` -- Data and delete files were added and removed without changing table data; i.e., compaction, changing the data file format, or relocating data files. +* `overwrite` -- Data and delete files were added and removed in a logical overwrite operation. +* `delete` -- Data files were removed and their contents logically deleted and/or delete files were added to delete rows. + +Data and delete files for a snapshot can be stored in more than one manifest. This enables: * Appends can add a new manifest to minimize the amount of data written, instead of adding new records by rewriting and appending to an existing manifest. (This is called a “fast append”.) * Tables can use multiple partition specs. A table’s partition configuration can evolve if, for example, its data volume changes. Each manifest uses a single partition spec, and queries do not need to change because partition filters are derived from data predicates. * Large tables can be split across multiple manifests so that implementations can parallelize job planning or reduce the cost of rewriting a manifest. +Manifests for a snapshot are tracked by a manifest list. + Valid snapshots are stored as a list in table metadata. For serialization, see Appendix C. +#### Manifest Lists + +Snapshots are embedded in table metadata, but the list of manifests for a snapshot are stored in a separate manifest list file. + +A new manifest list is written for each attempt to commit a snapshot because the list of manifests always changes to produce a new snapshot. When a manifest list is written, the (optimistic) sequence number of the snapshot is written for all new manifest files tracked by the list. + +A manifest list includes summary metadata that can be used to avoid scanning all of the manifests in a snapshot when planning a table scan. This includes the number of added, existing, and deleted files, and a summary of values for each field of the partition spec used to write the manifest. + +A manifest list is a valid Iceberg data file: files must use valid Iceberg formats, schemas, and column projection. + +Manifest list files store `manifest_file`, a struct with the following fields: + +| v1 | v2 | Field id, name | Type | Description | +| ---------- | ---------- |--------------------------------|---------------------------------------------|-------------| +| _required_ | _required_ | **`500 manifest_path`** | `string` | Location of the manifest file | +| _required_ | _required_ | **`501 manifest_length`** | `long` | Length of the manifest file | +| _required_ | _required_ | **`502 partition_spec_id`** | `int` | ID of a partition spec for the table; must be listed in table metadata `partition-specs` | +| | _required_ | **`517 content`** | `int` with meaning: `0: data`, `1: deletes` | The type of files tracked by the manifest, either data or delete files; 0 for all v1 manifests | +| | _required_ | **`515 sequence_number`** | `long` | The sequence number when the manifest was added to the table; use 0 when reading v1 manifest lists | +| | _required_ | **`516 min_sequence_number`** | `long` | The minimum sequence number of all data or delete files in the manifest; use 0 when reading v1 manifest lists | +| _required_ | _required_ | **`503 added_snapshot_id`** | `long` | ID of the snapshot where the manifest file was added | +| _optional_ | _required_ | **`504 added_files_count`** | `int` | Number of entries in the manifest that have status `ADDED` (1), when `null` this is assumed to be non-zero | +| _optional_ | _required_ | **`505 existing_files_count`** | `int` | Number of entries in the manifest that have status `EXISTING` (0), when `null` this is assumed to be non-zero | +| _optional_ | _required_ | **`506 deleted_files_count`** | `int` | Number of entries in the manifest that have status `DELETED` (2), when `null` this is assumed to be non-zero | +| _optional_ | _required_ | **`512 added_rows_count`** | `long` | Number of rows in all of files in the manifest that have status `ADDED`, when `null` this is assumed to be non-zero | +| _optional_ | _required_ | **`513 existing_rows_count`** | `long` | Number of rows in all of files in the manifest that have status `EXISTING`, when `null` this is assumed to be non-zero | +| _optional_ | _required_ | **`514 deleted_rows_count`** | `long` | Number of rows in all of files in the manifest that have status `DELETED`, when `null` this is assumed to be non-zero | +| _optional_ | _optional_ | **`507 partitions`** | `list<508: field_summary>` (see below) | A list of field summaries for each partition field in the spec. Each field in the list corresponds to a field in the manifest file’s partition spec. | + +`field_summary` is a struct with the following fields: + +| v1 | v2 | Field id, name | Type | Description | +| ---------- | ---------- |-------------------------|---------------|-------------| +| _required_ | _required_ | **`509 contains_null`** | `boolean` | Whether the manifest contains at least one partition with a null value for the field | +| _optional_ | _optional_ | **`510 lower_bound`** | `bytes` [1] | Lower bound for the non-null values in the partition field, or null if all values are null | +| _optional_ | _optional_ | **`511 upper_bound`** | `bytes` [1] | Upper bound for the non-null values in the partition field, or null if all values are null | + +Notes: + +1. Lower and upper bounds are serialized to bytes using the single-object serialization in Appendix D. The type of used to encode the value is the type of the partition field data. + + #### Scan Planning -Scans are planned by reading the manifest files for the current snapshot listed in the table metadata. Deleted entries in a manifest are not included in the scan. +Scans are planned by reading the manifest files for the current snapshot. Deleted entries in data and delete manifests are not used in a scan. + +Manifests that contain no matching files, determined using either file counts or partition summaries, may be skipped. -For each manifest, scan predicates, which filter data rows, are converted to partition predicates, which filter data files. These partition predicates are used to select the data files in the manifest. This conversion uses the partition spec used to write the manifest file. +For each manifest, scan predicates, which filter data rows, are converted to partition predicates, which filter data and delete files. These partition predicates are used to select the data and delete files in the manifest. This conversion uses the partition spec used to write the manifest file. -Scan predicates are converted to partition predicates using an inclusive projection: if a scan predicate matches a row, then the partition predicate must match that row’s partition. This is an _inclusive projection_ [1] because rows that do not match the scan predicate may be included in the scan by the partition predicate. +Scan predicates are converted to partition predicates using an _inclusive projection_: if a scan predicate matches a row, then the partition predicate must match that row’s partition. This is called _inclusive_ [1] because rows that do not match the scan predicate may be included in the scan by the partition predicate. For example, an `events` table with a timestamp column named `ts` that is partitioned by `ts_day=day(ts)` is queried by users with ranges over the timestamp column: `ts > X`. The inclusive projection is `ts_day >= day(X)`, which is used to select files that may have matching rows. Note that, in most cases, timestamps just before `X` will be included in the scan because the file contains rows that match the predicate and rows that do not match the predicate. -Notes: +Scan predicates are also used to filter data and delete files using column bounds and counts that are stored by field id in manifests. The same filter logic can be used for both data and delete files because both stored metrics of the rows either inserted or deleted. If metrics show that a delete file has no rows that match a scan predicate, it may be ignored just as a data file would be ignored [2]. Review comment: both stored -> both store ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org