Fokko commented on code in PR #139:
URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1412913056
##########
pyiceberg/table/__init__.py:
##########
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
removals: List[str]
+class TableMetadataUpdateContext:
+ updates: List[TableUpdate]
+ last_added_schema_id: Optional[int]
Review Comment:
Thanks for the thorough explanation.
> In UpdateSchema.commit(): We should check if new_schema is identical to
any existing schemas before adding an AddSchema and a SetCurrentSchema. If they
are identical, only a SetCurrentSchema(existing_schema_id) is necessary.
Yes, this makes a lot of sense to me.
> Regarding lastAddedSchemaId: It’s crucial to ensure that if it's not None,
it points to a schema added earlier in the current transaction. Since Python
transactions contain only one AddSchema, this should be straightforward. We
could also consider adding a uniqueness check in update_table_metadata to
confirm that the updates are from a single transaction. Also, instead of having
a lastAddedSchemaId, we may add a method called last_added_schema() to extract
the last added schema from the update context.
This also makes sense to me. However one thing to note. The
`lastAddedSchemaId` sounds to me like an implementation detail from Java. In
PyIceberg the current situation is simpler as you explained, so we could just
do `max(schema.id from tbl.schemas) + 1`.
You sparked my interest here, let me double check this on the REST side as
well 👍
##########
pyiceberg/table/__init__.py:
##########
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
removals: List[str]
+class TableMetadataUpdateContext:
+ updates: List[TableUpdate]
+ last_added_schema_id: Optional[int]
Review Comment:
Thanks for the thorough explanation.
> In UpdateSchema.commit(): We should check if new_schema is identical to
any existing schemas before adding an AddSchema and a SetCurrentSchema. If they
are identical, only a SetCurrentSchema(existing_schema_id) is necessary.
Yes, this makes a lot of sense to me.
> Regarding lastAddedSchemaId: It’s crucial to ensure that if it's not None,
it points to a schema added earlier in the current transaction. Since Python
transactions contain only one AddSchema, this should be straightforward. We
could also consider adding a uniqueness check in update_table_metadata to
confirm that the updates are from a single transaction. Also, instead of having
a lastAddedSchemaId, we may add a method called last_added_schema() to extract
the last added schema from the update context.
This also makes sense to me. However one thing to note. The
`lastAddedSchemaId` sounds to me like an implementation detail from Java. In
PyIceberg the current situation is simpler as you explained, so we could just
do `max(schema.id from tbl.schemas) + 1`.
You sparked my interest here, let me double check this on the REST side as
well 👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]