[GitHub] [iceberg] rdsr commented on a change in pull request #1499: Update the Iceberg spec for row-level deletes

GitBox Fri, 25 Sep 2020 06:15:41 -0700


rdsr commented on a change in pull request #1499:
URL: https://github.com/apache/iceberg/pull/1499#discussion_r494070093




##########
File path: site/docs/spec.md
##########
@@ -208,136 +256,202 @@ Notes:
 
 ### Manifests
 
-A manifest is an immutable Avro file that lists a set of data files, along 
with each file’s partition data tuple, metrics, and tracking information. One 
or more manifest files are used to store a snapshot, which tracks all of the 
files in a table at some point in time.
+A manifest is an immutable Avro file that lists data files or delete files, 
along with each file’s partition data tuple, metrics, and tracking information. 
One or more manifest files are used to store a [snapshot](#snapshots), which 
tracks all of the files in a table at some point in time. Manifests are tracked 
by a [manifest list](#manifest-lists) for each table snapshot.
+
+A manifest is a valid Iceberg data file: files must use valid Iceberg formats, 
schemas, and column projection.
+
+A manifest may store either data files or delete files, but not both because 
manifests that contain delete files are scanned first during job planning. 
Whether a manifest is a data manifest or a delete manifest is stored in 
manifest metadata.
 
-A manifest is a valid Iceberg data file. Files must use Iceberg schemas and 
column projection.
+A manifest stores files for a single partition spec. When a table’s partition 
spec changes, old files remain in the older manifest and newer files are 
written to a new manifest. This is required because a manifest file’s schema is 
based on its partition spec (see below). The partition spec of each manifest is 
also used to transform predicates on the table's data rows into predicates on 
partition values that are used during job planning to select files from a 
manifest.
 
-A manifest stores files for a single partition spec. When a table’s partition 
spec changes, old files remain in the older manifest and newer files are 
written to a new manifest. This is required because a manifest file’s schema is 
based on its partition spec (see below). This restriction also simplifies 
selecting files from a manifest because the same boolean expression can be used 
to select or filter all rows.
+A manifest file must store the partition spec and other metadata as properties 
in the Avro file's key-value metadata:
 
-The partition spec for a manifest and the current table schema must be stored 
in the key-value properties of the manifest file. The partition spec is stored 
as a JSON string under the key `partition-spec`. The table schema is stored as 
a JSON string under the key `schema`.
+| v1         | v2         | Key                 | Value                        
                                                |
+|------------|------------|---------------------|------------------------------------------------------------------------------|
+| _required_ | _required_ | `schema`            | JSON representation of the 
table schema at the time the manifest was written |
+| _required_ | _required_ | `partition-spec`    | JSON fields representation 
of the partition spec used to write the manifest  |
+| _optional_ | _required_ | `partition-spec-id` | Id of the partition spec 
used to write the manifest as a string              |
+| _optional_ | _required_ | `format-version`    | Table format version number 
of the manifest as a string                      |
+|            | _required_ | `content`           | Type of content files 
tracked by the manifest: "data" or "deletes"           |
 
 The schema of a manifest file is a struct called `manifest_entry` with the 
following fields:
 
-| Field id, name       | Type                                                  
    | Description                                                     |
-|----------------------|-----------------------------------------------------------|-----------------------------------------------------------------|
-| **`0  status`**      | `int` with meaning: `0: EXISTING` `1: ADDED` `2: 
DELETED` | Used to track additions and deletions                           |
-| **`1  snapshot_id`** | `long`                                                
    | Snapshot id where the file was added, or deleted if status is 2 |
-| **`2  data_file`**   | `data_file` `struct` (see below)                      
    | File path, partition tuple, metrics, ...                        |
+| v1         | v2         | Field id, name           | Type                    
                                  | Description                                 
                                          |
+| ---------- | ---------- 
|--------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`0  status`**          | `int` with meaning: `0: 
EXISTING` `1: ADDED` `2: DELETED` | Used to track additions and deletions       
                                          |
+| _required_ | _optional_ | **`1  snapshot_id`**     | `long`                  
                                  | Snapshot id where the file was added, or 
deleted if status is 2. Inherited when null. |
+|            | _optional_ | **`3  sequence_number`** | `long`                  
                                  | Sequence number when the file was added. 
Inherited when null.                         |
+| _required_ | _required_ | **`2  data_file`**       | `data_file` `struct` 
(see below)                          | File path, partition tuple, metrics, ... 
                                             |
 
 `data_file` is a struct with the following fields:
 
-| Field id, name                    | Type                                  | 
Description                                                                     
                                                                                
                                     |
-|-----------------------------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| **`100  file_path`**              | `string`                              | 
Full URI for the file with FS scheme                                            
                                                                                
                                     |
-| **`101  file_format`**            | `string`                              | 
String file format name, avro, orc or parquet                                   
                                                                                
                                     |
-| **`102  partition`**              | `struct<...>`                         | 
Partition data tuple, schema based on the partition spec                        
                                                                                
                                     |
-| **`103  record_count`**           | `long`                                | 
Number of records in this file                                                  
                                                                                
                                     |
-| **`104  file_size_in_bytes`**     | `long`                                | 
Total file size in bytes                                                        
                                                                                
                                     |
-| ~~**`105 block_size_in_bytes`**~~ | `long`                                | 
**Deprecated. Always write a default value and do not read.**                   
                                                                                
                                     |
-| ~~**`106  file_ordinal`**~~       | `optional int`                        | 
**Deprecated. Do not use.**                                                     
                                                                                
                                     |
-| ~~**`107  sort_columns`**~~       | `optional list`                       | 
**Deprecated. Do not use.**                                                     
                                                                                
                                     |
-| **`108  column_sizes`**           | `optional map`                        | 
Map from column id to the total size on disk of all regions that store the 
column. Does not include bytes necessary to read other columns, like footers. 
Leave null for row-oriented formats (Avro). |
-| **`109  value_counts`**           | `optional map`                        | 
Map from column id to number of values in the column (including null values)    
                                                                                
                                     |
-| **`110  null_value_counts`**      | `optional map`                        | 
Map from column id to number of null values in the column                       
                                                                                
                                     |
-| ~~**`111 distinct_counts`**~~     | `optional map`                        | 
**Deprecated. Do not use.**                                                     
                                                                                
                                     |
-| **`125  lower_bounds`**           | `optional map<126: int, 127: binary>` | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all values in the column for the file.      
                                      |
-| **`128  upper_bounds`**           | `optional map<129: int, 130: binary>` | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all values in the column for the file.   
                                      |
-| **`131  key_metadata`**           | `optional binary`                     | 
Implementation-specific key metadata for encryption                             
                                                                                
                                     |
-| **`132  split_offsets`**          | `optional list`                       | 
Split offsets for the data file. For example, all row group offsets in a 
Parquet file. Must be sorted ascending.                                         
                                            |
+| v1         | v2         | Field id, name                    | Type           
              | Description |
+| ---------- | ---------- 
|-----------------------------------|------------------------------|-------------|
+|            | _required_ | **`134  content`**                | `int` with 
meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of 
content stored by the data file: data, equality deletes, or position deletes 
(all v1 files are data files) |
+| _required_ | _required_ | **`100  file_path`**              | `string`       
              | Full URI for the file with FS scheme |
+| _required_ | _required_ | **`101  file_format`**            | `string`       
              | String file format name, avro, orc or parquet |
+| _required_ | _required_ | **`102  partition`**              | `struct<...>`  
              | Partition data tuple, schema based on the partition spec |
+| _required_ | _required_ | **`103  record_count`**           | `long`         
              | Number of records in this file |
+| _required_ | _required_ | **`104  file_size_in_bytes`**     | `long`         
              | Total file size in bytes |
+| _required_ |            | ~~**`105 block_size_in_bytes`**~~ | `long`         
              | **Deprecated. Always write a default in v1. Do not write in 
v2.** |
+| _optional_ |            | ~~**`106  file_ordinal`**~~       | `int`          
              | **Deprecated. Do not write.** |
+| _optional_ |            | ~~**`107  sort_columns`**~~       | `list<112: 
int>`             | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`108  column_sizes`**           | `map<117: int, 
118: long>`   | Map from column id to the total size on disk of all regions 
that store the column. Does not include bytes necessary to read other columns, 
like footers. Leave null for row-oriented formats (Avro) |
+| _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 
120: long>`   | Map from column id to number of values in the column (including 
null values) |
+| _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 
122: long>`   | Map from column id to number of null values in the column |
+| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 
124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`125  lower_bounds`**           | `map<126: int, 
127: binary>` | Map from column id to lower bound in the column serialized as 
binary [1]. Each value must be less than or equal to all values in the column 
for the file |
+| _optional_ | _optional_ | **`128  upper_bounds`**           | `map<129: int, 
130: binary>` | Map from column id to upper bound in the column serialized as 
binary [1]. Each value must be greater than or equal to all values in the 
column for the file |
+| _optional_ | _optional_ | **`131  key_metadata`**           | `binary`       
              | Implementation-specific key metadata for encryption |
+| _optional_ | _optional_ | **`132  split_offsets`**          | `list<133: 
long>`            | Split offsets for the data file. For example, all row group 
offsets in a Parquet file. Must be sorted ascending |
+|            | _optional_ | **`135  equality_ids`**           | `list<136: 
int>`             | Field ids used to determine row equality in equality delete 
files. Required when `content=2` and should be null otherwise. Fields with ids 
listed in this column must be present in the delete file |
 
 Notes:
 
 1. Single-value serialization for lower and upper bounds is detailed in 
Appendix D.
 
-The `partition` struct stores the tuple of partition values for each file. Its 
type is derived from the partition fields of the partition spec for the 
manifest file.
+The `partition` struct stores the tuple of partition values for each file. Its 
type is derived from the partition fields of the partition spec used to write 
the manifest file. In v2, the partition struct's field ids must match the ids 
from the partition spec.
 
-Each manifest file must store its partition spec and the current table schema 
in the Avro file’s key-value metadata. The partition spec is used to transform 
predicates on the table’s data rows into predicates on the manifest’s partition 
values during job planning.
+The column metrics maps are used when filtering to select both data and delete 
files. For delete files, the metrics must store bounds and counts for all 
deleted rows, or must be omitted. Storing metrics for deleted rows ensures that 
the values can be used during job planning to find delete files that must be 
merged during a scan.
 
 
 #### Manifest Entry Fields
 
 The manifest entry fields are used to keep track of the snapshot in which 
files were added or logically deleted. The `data_file` struct is nested inside 
of the manifest entry so that it can be easily passed to job planning without 
the manifest entry fields.
 
-When a data file is added to the dataset, it’s manifest entry should store the 
snapshot ID in which the file was added and set status to 1 (added).
+When a file is added to the dataset, it’s manifest entry should store the 
snapshot ID in which the file was added and set status to 1 (added).
 
-When a data file is replaced or deleted from the dataset, it’s manifest entry 
fields store the snapshot ID in which the file was deleted and status 2 
(deleted). The file may be deleted from the file system when the snapshot in 
which it was deleted is garbage collected, assuming that older snapshots have 
also been garbage collected [1].
+When a file is replaced or deleted from the dataset, it’s manifest entry 
fields store the snapshot ID in which the file was deleted and status 2 
(deleted). The file may be deleted from the file system when the snapshot in 
which it was deleted is garbage collected, assuming that older snapshots have 
also been garbage collected [1].
+
+Iceberg v2 adds a sequence number to the entry and makes the snapshot id 
optional. Both fields, `sequence_number` and `snapshot_id`, are inherited from 
manifest metadata when `null`. That is, if the field is `null` for an entry, 
then the entry must inherit its value from the manifest file's metadata, stored 
in the manifest list [2].
 
 Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains 
the file as “live” data is garbage collected. But this is harder to detect and 
requires finding the diff of multiple snapshots. It is easier to track what 
files are deleted in a snapshot and delete them when that snapshot expires.
+2. Manifest list files are required in v2, so that the `sequence_number` and 
`snapshot_id` to inherit are always available.
+
+#### Sequence Number Inheritance
+
+Manifests track the sequence number when a data or delete file was added to 
the table.
+
+When adding new file, its sequence number is set to `null` because the 
snapshot's sequence number is not assigned until the snapshot is successfully 
committed. When reading, sequence numbers are inherited by replacing `null` 
with the manifest's sequence number from the manifest list.
+
+When writing an existing file to a new manifest, the sequence number must be 
non-null and set to the sequence number that was inherited.
+
+Inheriting sequence numbers through the metadata tree allows writing a new 
manifest without a known sequence number, so that a manifest can be written 
once and reused in commit retries. To change a sequence number for a retry, 
only the manifest list must be rewritten.
+
+When reading v1 manifests with no sequence number column, sequence numbers for 
all files must default to 0.
+
 
 ### Snapshots
 
 A snapshot consists of the following fields:
 
-*   **`snapshot-id`** -- A unique long ID.
-*   **`parent-snapshot-id`** -- (Optional) The snapshot ID of the snapshot’s 
parent. This field is not present for snapshots that have no parent snapshot, 
such as snapshots created before this field was added or the first snapshot of 
a table.
-*   **`sequence-number`** -- A monotonically increasing long that tracks the 
order of snapshots in a table. (**v2 only**)
-*   **`timestamp-ms`** -- A timestamp when the snapshot was created. This is 
used when garbage collecting snapshots.
-*   **`manifests`** -- A list of manifest file locations. The data files in a 
snapshot are the union of all data files listed in these manifests. (Deprecated 
in favor of `manifest-list`)
-*   **`manifest-list`** -- (Optional) The location of a manifest list file for 
this snapshot, which contains a list of manifest files with additional 
metadata. If present, the manifests field must be omitted.
-*   **`summary`** -- (Optional) A summary that encodes the `operation` that 
produced the snapshot and other relevant information specific to that 
operation. This allows some operations like snapshot expiration to skip 
processing some snapshots. Possible values of `operation` are:
-    *   `append` -- Data files were added and no files were removed.
-    *   `replace` -- Data files were rewritten with the same data; i.e., 
compaction, changing the data file format, or relocating data files.
-    *   `overwrite` -- Data files were deleted and added in a logical 
overwrite operation.
-    *   `delete` -- Data files were removed and their contents logically 
deleted.
+| v1         | v2         | Field                    | Description |
+| ---------- | ---------- | ------------------------ | ----------- |
+| _required_ | _required_ | **`snapshot-id`**        | A unique long ID |
+| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent |
+|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table |
+| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection |
+| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
meadata |
+| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present |
+| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below) |
 
-Snapshots can be split across more than one manifest. This enables:
+The snapshot summary's `operation` field is used by some operations, like 
snapshot expiration, to skip processing certain snapshots. Possible `operation` 
values are:
+
+*   `append` -- Only data files were added and no files were removed.
+*   `replace` -- Data and delete files were added and removed without changing 
table data; i.e., compaction, changing the data file format, or relocating data 
files.
+*   `overwrite` -- Data and delete files were added and removed in a logical 
overwrite operation.
+*   `delete` -- Data files were removed and their contents logically deleted 
and/or delete files were added to delete rows.
+
+Data and delete files for a snapshot can be stored in more than one manifest. 
This enables:
 
 *   Appends can add a new manifest to minimize the amount of data written, 
instead of adding new records by rewriting and appending to an existing 
manifest. (This is called a “fast append”.)
 *   Tables can use multiple partition specs. A table’s partition configuration 
can evolve if, for example, its data volume changes. Each manifest uses a 
single partition spec, and queries do not need to change because partition 
filters are derived from data predicates.
 *   Large tables can be split across multiple manifests so that 
implementations can parallelize job planning or reduce the cost of rewriting a 
manifest.
 
+Manifests for a snapshot are tracked by a manifest list.
+
 Valid snapshots are stored as a list in table metadata. For serialization, see 
Appendix C.
 
 
+#### Manifest Lists
+
+Snapshots are embedded in table metadata, but the list of manifests for a 
snapshot are stored in a separate manifest list file.
+
+A new manifest list is written for each attempt to commit a snapshot because 
the list of manifests always changes to produce a new snapshot. When a manifest 
list is written, the (optimistic) sequence number of the snapshot is written 
for all new manifest files tracked by the list.
+
+A manifest list includes summary metadata that can be used to avoid scanning 
all of the manifests in a snapshot when planning a table scan. This includes 
the number of added, existing, and deleted files, and a summary of values for 
each field of the partition spec used to write the manifest.
+
+A manifest list is a valid Iceberg data file: files must use valid Iceberg 
formats, schemas, and column projection.
+
+Manifest list files store `manifest_file`, a struct with the following fields:
+
+| v1         | v2         | Field id, name                 | Type              
                          | Description |
+| ---------- | ---------- 
|--------------------------------|---------------------------------------------|-------------|
+| _required_ | _required_ | **`500 manifest_path`**        | `string`          
                          | Location of the manifest file |
+| _required_ | _required_ | **`501 manifest_length`**      | `long`            
                          | Length of the manifest file |
+| _required_ | _required_ | **`502 partition_spec_id`**    | `int`             
                          | ID of a partition spec for the table; must be 
listed in table metadata `partition-specs` |
+|            | _required_ | **`517 content`**              | `int` with 
meaning: `0: data`, `1: deletes` | The type of files tracked by the manifest, 
either data or delete files; 0 for all v1 manifests |
+|            | _required_ | **`515 sequence_number`**      | `long`            
                          | The sequence number when the manifest was added to 
the table; use 0 when reading v1 manifest lists |
+|            | _required_ | **`516 min_sequence_number`**  | `long`            
                          | The minimum sequence number of all data or delete 
files in the manifest; use 0 when reading v1 manifest lists |
+| _required_ | _required_ | **`503 added_snapshot_id`**    | `long`            
                          | ID of the snapshot where the  manifest file was 
added |
+| _optional_ | _required_ | **`504 added_files_count`**    | `int`             
                          | Number of entries in the manifest that have status 
`ADDED` (1), when `null` this is assumed to be non-zero |
+| _optional_ | _required_ | **`505 existing_files_count`** | `int`             
                          | Number of entries in the manifest that have status 
`EXISTING` (0), when `null` this is assumed to be non-zero |
+| _optional_ | _required_ | **`506 deleted_files_count`**  | `int`             
                          | Number of entries in the manifest that have status 
`DELETED` (2), when `null` this is assumed to be non-zero |
+| _optional_ | _required_ | **`512 added_rows_count`**     | `long`            
                          | Number of rows in all of files in the manifest that 
have status `ADDED`, when `null` this is assumed to be non-zero |
+| _optional_ | _required_ | **`513 existing_rows_count`**  | `long`            
                          | Number of rows in all of files in the manifest that 
have status `EXISTING`, when `null` this is assumed to be non-zero |
+| _optional_ | _required_ | **`514 deleted_rows_count`**   | `long`            
                          | Number of rows in all of files in the manifest that 
have status `DELETED`, when `null` this is assumed to be non-zero |
+| _optional_ | _optional_ | **`507 partitions`**           | `list<508: 
field_summary>` (see below)      | A list of field summaries for each partition 
field in the spec. Each field in the list corresponds to a field in the 
manifest file’s partition spec. |
+
+`field_summary` is a struct with the following fields:
+
+| v1         | v2         | Field id, name          | Type          | 
Description |
+| ---------- | ---------- 
|-------------------------|---------------|-------------|
+| _required_ | _required_ | **`509 contains_null`** | `boolean`     | Whether 
the manifest contains at least one partition with a null value for the field |
+| _optional_ | _optional_ | **`510 lower_bound`**   | `bytes`   [1] | Lower 
bound for the non-null values in the partition field, or null if all values are 
null |
+| _optional_ | _optional_ | **`511 upper_bound`**   | `bytes`   [1] | Upper 
bound for the non-null values in the partition field, or null if all values are 
null |
+
+Notes:
+
+1. Lower and upper bounds are serialized to bytes using the single-object 
serialization in Appendix D. The type of used to encode the value is the type 
of the partition field data.
+
+
 #### Scan Planning
 
-Scans are planned by reading the manifest files for the current snapshot 
listed in the table metadata. Deleted entries in a manifest are not included in 
the scan.
+Scans are planned by reading the manifest files for the current snapshot. 
Deleted entries in data and delete manifests are not used in a scan.
+
+Manifests that contain no matching files, determined using either file counts 
or partition summaries, may be skipped.
 
-For each manifest, scan predicates, which filter data rows, are converted to 
partition predicates, which filter data files. These partition predicates are 
used to select the data files in the manifest. This conversion uses the 
partition spec used to write the manifest file.
+For each manifest, scan predicates, which filter data rows, are converted to 
partition predicates, which filter data and delete files. These partition 
predicates are used to select the data and delete files in the manifest. This 
conversion uses the partition spec used to write the manifest file.
 
-Scan predicates are converted to partition predicates using an inclusive 
projection: if a scan predicate matches a row, then the partition predicate 
must match that row’s partition. This is an _inclusive projection_ [1] because 
rows that do not match the scan predicate may be included in the scan by the 
partition predicate.
+Scan predicates are converted to partition predicates using an _inclusive 
projection_: if a scan predicate matches a row, then the partition predicate 
must match that row’s partition. This is called _inclusive_ [1] because rows 
that do not match the scan predicate may be included in the scan by the 
partition predicate.
 
 For example, an `events` table with a timestamp column named `ts` that is 
partitioned by `ts_day=day(ts)` is queried by users with ranges over the 
timestamp column: `ts > X`. The inclusive projection is `ts_day >= day(X)`, 
which is used to select files that may have matching rows. Note that, in most 
cases, timestamps just before `X` will be included in the scan because the file 
contains rows that match the predicate and rows that do not match the predicate.
 
-Notes:
+Scan predicates are also used to filter data and delete files using column 
bounds and counts that are stored by field id in manifests. The same filter 
logic can be used for both data and delete files because both stored metrics of 
the rows either inserted or deleted. If metrics show that a delete file has no 
rows that match a scan predicate, it may be ignored just as a data file would 
be ignored [2].

Review comment:
       both stored -> both store ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] rdsr commented on a change in pull request #1499: Update the Iceberg spec for row-level deletes

Reply via email to