Re: [PR] Update table metadata [iceberg-python]

via GitHub Sat, 02 Dec 2023 18:46:25 -0800


Fokko commented on code in PR #139:
URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1412913056



##########
pyiceberg/table/__init__.py:
##########
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
     removals: List[str]
 
 
+class TableMetadataUpdateContext:
+    updates: List[TableUpdate]
+    last_added_schema_id: Optional[int]

Review Comment:
   Thanks for the thorough explanation. 
   
   > In UpdateSchema.commit(): We should check if new_schema is identical to 
any existing schemas before adding an AddSchema and a SetCurrentSchema. If they 
are identical, only a SetCurrentSchema(existing_schema_id) is necessary.
   
   Yes, this makes a lot of sense to me. 
   
   > Regarding lastAddedSchemaId: It’s crucial to ensure that if it's not None, 
it points to a schema added earlier in the current transaction. Since Python 
transactions contain only one AddSchema, this should be straightforward. We 
could also consider adding a uniqueness check in update_table_metadata to 
confirm that the updates are from a single transaction. Also, instead of having 
a lastAddedSchemaId, we may add a method called last_added_schema() to extract 
the last added schema from the update context.
   
   This also makes sense to me. However one thing to note. The 
`lastAddedSchemaId` sounds to me like an implementation detail from Java. In 
PyIceberg the current situation is simpler as you explained, so we could just 
do `max(schema.id from tbl.schemas) + 1`.
   
   You sparked my interest here, let me double check this on the REST side as 
well 👍 
   



##########
pyiceberg/table/__init__.py:
##########
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
     removals: List[str]
 
 
+class TableMetadataUpdateContext:
+    updates: List[TableUpdate]
+    last_added_schema_id: Optional[int]

Review Comment:
   Thanks for the thorough explanation. 
   
   > In UpdateSchema.commit(): We should check if new_schema is identical to 
any existing schemas before adding an AddSchema and a SetCurrentSchema. If they 
are identical, only a SetCurrentSchema(existing_schema_id) is necessary.
   
   Yes, this makes a lot of sense to me. 
   
   > Regarding lastAddedSchemaId: It’s crucial to ensure that if it's not None, 
it points to a schema added earlier in the current transaction. Since Python 
transactions contain only one AddSchema, this should be straightforward. We 
could also consider adding a uniqueness check in update_table_metadata to 
confirm that the updates are from a single transaction. Also, instead of having 
a lastAddedSchemaId, we may add a method called last_added_schema() to extract 
the last added schema from the update context.
   
   This also makes sense to me. However one thing to note. The 
`lastAddedSchemaId` sounds to me like an implementation detail from Java. In 
PyIceberg the current situation is simpler as you explained, so we could just 
do `max(schema.id from tbl.schemas) + 1`.
   
   You sparked my interest here, let me double check this on the REST side as 
well 👍 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Update table metadata [iceberg-python]

Reply via email to