rdblue commented on a change in pull request #2762:
URL: https://github.com/apache/iceberg/pull/2762#discussion_r664075720
##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as
individual binary valu
## Format version changes
+Some tables in this spec have columns that specify requirements for v1 and v2
tables. These requirements are intended for writers when adding metadata files
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank) | The field should be omitted |
+| _optional_ | The field can be written |
+| _required_ | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2
tables so that tables can be upgraded to v2 without rewriting the metadata
tree. For manifest list and manifest files, this table shows the expected v2
read behavior:
+
+| v1 | v2 | v2 read behavior |
+|------------|------------|------------------|
+| | _optional_ | Read the field as _optional_ |
+| | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _optional_ | | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _required_ | | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will
not appear in a v2 table. Required v2 fields that were not present in v1 or
optional in v1 may be handled as required fields.
Review comment:
What I'm trying to say here is that the metadata.json file is written
for a particular version, with the version embedded in the file. That means a
reader can more strictly enforce v2 requirements. If the json file was written
by a v2 writer but doesn't have the `last-sequence-number` then that should be
an error.
##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as
individual binary valu
## Format version changes
+Some tables in this spec have columns that specify requirements for v1 and v2
tables. These requirements are intended for writers when adding metadata files
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank) | The field should be omitted |
+| _optional_ | The field can be written |
+| _required_ | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2
tables so that tables can be upgraded to v2 without rewriting the metadata
tree. For manifest list and manifest files, this table shows the expected v2
read behavior:
+
+| v1 | v2 | v2 read behavior |
+|------------|------------|------------------|
+| | _optional_ | Read the field as _optional_ |
+| | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _optional_ | | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _required_ | | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will
not appear in a v2 table. Required v2 fields that were not present in v1 or
optional in v1 may be handled as required fields.
+
### Version 2
Writing v1 metadata:
-* Table metadata field `last-sequence-number` should not be written.
-* Snapshot field `sequence-number` should not be written.
+* Table metadata field `last-sequence-number` should not be written
+* Snapshot field `sequence-number` should not be written
+* Manifest entry field `sequence_number` should not be written
Review comment:
Yes, I think you're right. Those fields should also not be written to v1
tables.
##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as
individual binary valu
## Format version changes
+Some tables in this spec have columns that specify requirements for v1 and v2
tables. These requirements are intended for writers when adding metadata files
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank) | The field should be omitted |
+| _optional_ | The field can be written |
+| _required_ | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2
tables so that tables can be upgraded to v2 without rewriting the metadata
tree. For manifest list and manifest files, this table shows the expected v2
read behavior:
+
+| v1 | v2 | v2 read behavior |
+|------------|------------|------------------|
+| | _optional_ | Read the field as _optional_ |
+| | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _optional_ | | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _required_ | | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will
not appear in a v2 table. Required v2 fields that were not present in v1 or
optional in v1 may be handled as required fields.
+
### Version 2
Writing v1 metadata:
-* Table metadata field `last-sequence-number` should not be written.
-* Snapshot field `sequence-number` should not be written.
+* Table metadata field `last-sequence-number` should not be written
+* Snapshot field `sequence-number` should not be written
+* Manifest entry field `sequence_number` should not be written
-Reading v1 metadata:
+Reading v1 metadata for v2:
-* Table metadata field `last-sequence-number` must default to 0.
-* Snapshot field `sequence-number` must default to 0.
+* Table metadata field `last-sequence-number` must default to 0
Review comment:
That's effectively what we do. Any v1 metadata should default the
sequence number to 0, which is when the table was upgraded from v1 to v2.
##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as
individual binary valu
## Format version changes
+Some tables in this spec have columns that specify requirements for v1 and v2
tables. These requirements are intended for writers when adding metadata files
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank) | The field should be omitted |
+| _optional_ | The field can be written |
+| _required_ | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2
tables so that tables can be upgraded to v2 without rewriting the metadata
tree. For manifest list and manifest files, this table shows the expected v2
read behavior:
+
+| v1 | v2 | v2 read behavior |
+|------------|------------|------------------|
+| | _optional_ | Read the field as _optional_ |
+| | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _optional_ | | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _required_ | | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will
not appear in a v2 table. Required v2 fields that were not present in v1 or
optional in v1 may be handled as required fields.
+
### Version 2
Writing v1 metadata:
-* Table metadata field `last-sequence-number` should not be written.
-* Snapshot field `sequence-number` should not be written.
+* Table metadata field `last-sequence-number` should not be written
+* Snapshot field `sequence-number` should not be written
+* Manifest entry field `sequence_number` should not be written
-Reading v1 metadata:
+Reading v1 metadata for v2:
-* Table metadata field `last-sequence-number` must default to 0.
-* Snapshot field `sequence-number` must default to 0.
+* Table metadata field `last-sequence-number` must default to 0
+* Snapshot field `sequence-number` must default to 0
+* Manifest list field `content` must default to 0 (data)
+* Data file field `content` must default to 0 (data)
Writing v2 metadata:
-* Table metadata added required field `last-sequence-number`.
-* Table metadata now requires field `table-uuid`.
-* Table metadata now requires field `partition-specs`.
-* Table metadata now requires field `default-spec-id`.
-* Table metadata now requires field `last-partition-id`.
-* Table metadata field `partition-spec` is no longer required and may be
omitted.
-* Snapshot added required field `sequence-number`.
-* Snapshot now requires field `manifest-list`.
-* Snapshot field `manifests` is no longer allowed.
-* Table metadata now requires field `sort-orders`.
-* Table metadata now requires field `default-sort-order-id`.
+* Table metadata JSON:
+ * `last-sequence-number` was added and is required; default to 0 when
reading v1 metadata
Review comment:
Yes, but I don't want anyone to miss that.
##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as
individual binary valu
## Format version changes
+Some tables in this spec have columns that specify requirements for v1 and v2
tables. These requirements are intended for writers when adding metadata files
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank) | The field should be omitted |
+| _optional_ | The field can be written |
+| _required_ | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2
tables so that tables can be upgraded to v2 without rewriting the metadata
tree. For manifest list and manifest files, this table shows the expected v2
read behavior:
+
+| v1 | v2 | v2 read behavior |
+|------------|------------|------------------|
+| | _optional_ | Read the field as _optional_ |
+| | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _optional_ | | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _required_ | | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will
not appear in a v2 table. Required v2 fields that were not present in v1 or
optional in v1 may be handled as required fields.
+
### Version 2
Writing v1 metadata:
-* Table metadata field `last-sequence-number` should not be written.
-* Snapshot field `sequence-number` should not be written.
+* Table metadata field `last-sequence-number` should not be written
+* Snapshot field `sequence-number` should not be written
+* Manifest entry field `sequence_number` should not be written
-Reading v1 metadata:
+Reading v1 metadata for v2:
-* Table metadata field `last-sequence-number` must default to 0.
-* Snapshot field `sequence-number` must default to 0.
+* Table metadata field `last-sequence-number` must default to 0
+* Snapshot field `sequence-number` must default to 0
+* Manifest list field `content` must default to 0 (data)
+* Data file field `content` must default to 0 (data)
Writing v2 metadata:
-* Table metadata added required field `last-sequence-number`.
-* Table metadata now requires field `table-uuid`.
-* Table metadata now requires field `partition-specs`.
-* Table metadata now requires field `default-spec-id`.
-* Table metadata now requires field `last-partition-id`.
-* Table metadata field `partition-spec` is no longer required and may be
omitted.
-* Snapshot added required field `sequence-number`.
-* Snapshot now requires field `manifest-list`.
-* Snapshot field `manifests` is no longer allowed.
-* Table metadata now requires field `sort-orders`.
-* Table metadata now requires field `default-sort-order-id`.
+* Table metadata JSON:
+ * `last-sequence-number` was added and is required; default to 0 when
reading v1 metadata
+ * `table-uuid` is now required
+ * `current-schema-id` is now required
+ * `schemas` is now required
+ * `partition-specs` is now required
+ * `default-spec-id` is now required
+ * `last-partition-id` is now required
+ * `sort-orders` is now required
+ * `default-sort-order-id` is now required
Review comment:
Yes, I think to be safe we would want v2 writers to have at least the
unsorted order with id 0, but to read we would just assume that it exists.
##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as
individual binary valu
## Format version changes
+Some tables in this spec have columns that specify requirements for v1 and v2
tables. These requirements are intended for writers when adding metadata files
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank) | The field should be omitted |
+| _optional_ | The field can be written |
+| _required_ | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2
tables so that tables can be upgraded to v2 without rewriting the metadata
tree. For manifest list and manifest files, this table shows the expected v2
read behavior:
+
+| v1 | v2 | v2 read behavior |
+|------------|------------|------------------|
+| | _optional_ | Read the field as _optional_ |
+| | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _optional_ | | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _required_ | | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will
not appear in a v2 table. Required v2 fields that were not present in v1 or
optional in v1 may be handled as required fields.
+
### Version 2
Writing v1 metadata:
-* Table metadata field `last-sequence-number` should not be written.
-* Snapshot field `sequence-number` should not be written.
+* Table metadata field `last-sequence-number` should not be written
+* Snapshot field `sequence-number` should not be written
+* Manifest entry field `sequence_number` should not be written
Review comment:
I added more fields here.
##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as
individual binary valu
## Format version changes
+Some tables in this spec have columns that specify requirements for v1 and v2
tables. These requirements are intended for writers when adding metadata files
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank) | The field should be omitted |
+| _optional_ | The field can be written |
+| _required_ | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2
tables so that tables can be upgraded to v2 without rewriting the metadata
tree. For manifest list and manifest files, this table shows the expected v2
read behavior:
+
+| v1 | v2 | v2 read behavior |
+|------------|------------|------------------|
+| | _optional_ | Read the field as _optional_ |
+| | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _optional_ | | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in
v1 files |
+| _required_ | | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will
not appear in a v2 table. Required v2 fields that were not present in v1 or
optional in v1 may be handled as required fields.
Review comment:
I updated this with an example and clarified the wording:
> Readers may be more strict for metadata JSON files because the JSON files
are not reused and will always match the table version. Required v2 fields that
were not present in v1 or optional in v1 may be handled as required fields. For
example, a v2 table that is missing `last-sequence-number` can throw an
exception.
##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as
individual binary valu
## Format version changes
+Some tables in this spec have columns that specify requirements for v1 and v2
tables. These requirements are intended for writers when adding metadata files
to a table with the given version.
+
+| Requirement | Write behavior |
Review comment:
Moved this under definitions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]