[GitHub] [iceberg] shardulm94 commented on a change in pull request #1499: Update the Iceberg spec for row-level deletes

GitBox Fri, 25 Sep 2020 06:22:42 -0700


shardulm94 commented on a change in pull request #1499:
URL: https://github.com/apache/iceberg/pull/1499#discussion_r494750352




##########
File path: site/docs/spec.md
##########
@@ -416,25 +530,91 @@ Notes:
 
 ### Delete Formats
 
-This section details how to encode row-level deletes in Iceberg metadata. 
Row-level deletes are not supported in the current format version 1. This part 
of the spec is not yet complete and will be completed as format version 2.
+This section details how to encode row-level deletes in Iceberg delete files. 
Row-level deletes are not supported in v1.
+
+Row-level delete files are valid Iceberg data files: files must use valid 
Iceberg formats, schemas, and column projection. It is recommended that delete 
files are written using the table's default file format.
+
+Row-level delete files are tracked by manifests, like data files. A separate 
set of manifests is used for delete files, but the manifest schemas are 
identical.
 
-#### Position-based Delete Files
+Both position and equality deletes allow encoding deleted row values with a 
delete. This can be used to reconstruct a stream of changes to a table.
 
-Position-based delete files identify rows in one or more data files that have 
been deleted.
+
+#### Position Delete Files
+
+Position-based delete files identify deleted rows by file and position in one 
or more data files, and may optionally contain the deleted row.
+
+A data row is deleted if there is an entry in a position delete file for the 
row's file and position in the data file, starting at 0.
 
 Position-based delete files store `file_position_delete`, a struct with the 
following fields:
 
-| Field id, name          | Type                            | Description      
                                                                                
                        |
-|-------------------------|---------------------------------|--------------------------------------------------------------------------------------------------------------------------|
-| **`1  file_path`**     | `required string`               | The full URI of a 
data file with FS scheme. This must match the `file_path` of the target data 
file in a manifest entry.   |
-| **`2  position`**      | `required long`                 | The ordinal 
position of a deleted row in the target data file identified by `file_path`, 
starting at `0`.                    |
+| Field id, name              | Type                   | Description |
+|-----------------------------|------------------------|-------------|
+| **`2147483546  file_path`** | `string`               | Full URI or a data 
file with FS scheme. This must match the `file_path` of the target data file in 
a manifest entry |

Review comment:
       Typo: Full URI `of`

##########
File path: site/docs/spec.md
##########
@@ -416,25 +530,91 @@ Notes:
 
 ### Delete Formats
 
-This section details how to encode row-level deletes in Iceberg metadata. 
Row-level deletes are not supported in the current format version 1. This part 
of the spec is not yet complete and will be completed as format version 2.
+This section details how to encode row-level deletes in Iceberg delete files. 
Row-level deletes are not supported in v1.
+
+Row-level delete files are valid Iceberg data files: files must use valid 
Iceberg formats, schemas, and column projection. It is recommended that delete 
files are written using the table's default file format.
+
+Row-level delete files are tracked by manifests, like data files. A separate 
set of manifests is used for delete files, but the manifest schemas are 
identical.
 
-#### Position-based Delete Files
+Both position and equality deletes allow encoding deleted row values with a 
delete. This can be used to reconstruct a stream of changes to a table.
 
-Position-based delete files identify rows in one or more data files that have 
been deleted.
+
+#### Position Delete Files
+
+Position-based delete files identify deleted rows by file and position in one 
or more data files, and may optionally contain the deleted row.
+
+A data row is deleted if there is an entry in a position delete file for the 
row's file and position in the data file, starting at 0.
 
 Position-based delete files store `file_position_delete`, a struct with the 
following fields:
 
-| Field id, name          | Type                            | Description      
                                                                                
                        |
-|-------------------------|---------------------------------|--------------------------------------------------------------------------------------------------------------------------|
-| **`1  file_path`**     | `required string`               | The full URI of a 
data file with FS scheme. This must match the `file_path` of the target data 
file in a manifest entry.   |
-| **`2  position`**      | `required long`                 | The ordinal 
position of a deleted row in the target data file identified by `file_path`, 
starting at `0`.                    |
+| Field id, name              | Type                   | Description |
+|-----------------------------|------------------------|-------------|
+| **`2147483546  file_path`** | `string`               | Full URI or a data 
file with FS scheme. This must match the `file_path` of the target data file in 
a manifest entry |
+| **`2147483545  pos`**       | `long`                 | Ordinal position of a 
deleted row in the target data file identified by `file_path`, starting at `0` |
+| **`2147483544  row`**       | `required struct<...>` | Deleted row values. 
Omit the column when not storing deleted rows. |
+
+When the deleted row is present, its schema may be any subset of the table 
schema. When the column is present, the deleted row values for every delete 
must be stored.
 
 The rows in the delete file must be sorted by `file_path` then `position` to 
optimize filtering rows while scanning. 
 
 *  Sorting by `file_path` allows filter pushdown by file in columnar storage 
formats.
 *  Sorting by `position` allows filtering rows while scanning, to avoid 
keeping deletes in memory.
 
-Though the delete files can be written using any supported data file format in 
Iceberg, it is recommended to write delete files with same file format as the 
table's file format.
+#### Equality Delete Files
+
+Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
+
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests).
+
+A data row is deleted if its values are equal to all delete columns for any 
row in an equality delete file that applies to the row's data file (see [`Job 
Planning`](#job-planning)).

Review comment:
       Should this be linked to `#scan-planning`?

##########
File path: site/docs/spec.md
##########
@@ -416,25 +530,91 @@ Notes:
 
 ### Delete Formats
 
-This section details how to encode row-level deletes in Iceberg metadata. 
Row-level deletes are not supported in the current format version 1. This part 
of the spec is not yet complete and will be completed as format version 2.
+This section details how to encode row-level deletes in Iceberg delete files. 
Row-level deletes are not supported in v1.
+
+Row-level delete files are valid Iceberg data files: files must use valid 
Iceberg formats, schemas, and column projection. It is recommended that delete 
files are written using the table's default file format.

Review comment:
       What is the reason behind the recommendation for delete files and table 
data to have the same file format?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1499: Update the Iceberg spec for row-level deletes

Reply via email to