Re: [PR] Introduce `assign_fresh_ids` flag and allow skipping fresh assignment of IDs on Table creation [iceberg-python]

via GitHub Sat, 23 Nov 2024 12:05:53 -0800


kevinjqliu commented on code in PR #1304:
URL: https://github.com/apache/iceberg-python/pull/1304#discussion_r1855257218



##########
pyiceberg/table/metadata.py:
##########
@@ -517,12 +517,15 @@ def new_table_metadata(
     location: str,
     properties: Properties = EMPTY_DICT,
     table_uuid: Optional[uuid.UUID] = None,
+    assign_fresh_ids: bool = True,
 ) -> TableMetadata:
     from pyiceberg.table import TableProperties
 
-    fresh_schema = assign_fresh_schema_ids(schema)
-    fresh_partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
-    fresh_sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+    if assign_fresh_ids:
+        fresh_schema = assign_fresh_schema_ids(schema)
+        partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
+        sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+        schema = fresh_schema

Review Comment:
   this is where `assign_fresh_ids` is ultimately used



##########
pyiceberg/table/metadata.py:
##########
@@ -517,12 +517,15 @@ def new_table_metadata(
     location: str,
     properties: Properties = EMPTY_DICT,
     table_uuid: Optional[uuid.UUID] = None,
+    assign_fresh_ids: bool = True,
 ) -> TableMetadata:
     from pyiceberg.table import TableProperties
 
-    fresh_schema = assign_fresh_schema_ids(schema)
-    fresh_partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
-    fresh_sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+    if assign_fresh_ids:
+        fresh_schema = assign_fresh_schema_ids(schema)
+        partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
+        sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+        schema = fresh_schema

Review Comment:
   and this function is called by `_create_staged_table` and `create_table`



##########
pyiceberg/table/metadata.py:
##########
@@ -517,12 +517,15 @@ def new_table_metadata(
     location: str,
     properties: Properties = EMPTY_DICT,
     table_uuid: Optional[uuid.UUID] = None,
+    assign_fresh_ids: bool = True,
 ) -> TableMetadata:
     from pyiceberg.table import TableProperties
 
-    fresh_schema = assign_fresh_schema_ids(schema)
-    fresh_partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
-    fresh_sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+    if assign_fresh_ids:
+        fresh_schema = assign_fresh_schema_ids(schema)
+        partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
+        sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+        schema = fresh_schema

Review Comment:
   My proposal is to not include `assign_fresh_ids` as a flag in functions 
other than `new_table_metadata`. 
   So when `_create_staged_table` and `create_table` is given
   * a `pa.Schema`, convert to `Schema` and set `assign_fresh_ids` to True in 
`new_table_metadata`
   * a `Schema`. Assume the user created `Schema` with the correct IDs 
(possibly verify some correctness characteristics such as uniqueness). And use 
the schema as is
   If a user wants to reassign IDs for `Schema`, this can be done outside the 
`create_table` functions and we can  even provide a helper function to do so. 
   
   I feel like this way can help break apart the responsibilities of schema id 
assignment from the `create_table` methods. 



##########
pyiceberg/table/metadata.py:
##########
@@ -517,12 +517,15 @@ def new_table_metadata(
     location: str,
     properties: Properties = EMPTY_DICT,
     table_uuid: Optional[uuid.UUID] = None,
+    assign_fresh_ids: bool = True,
 ) -> TableMetadata:
     from pyiceberg.table import TableProperties
 
-    fresh_schema = assign_fresh_schema_ids(schema)
-    fresh_partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
-    fresh_sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+    if assign_fresh_ids:
+        fresh_schema = assign_fresh_schema_ids(schema)
+        partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
+        sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+        schema = fresh_schema

Review Comment:
   both functions take `schema: Union[Schema, "pa.Schema"],` as input. 
   * If `pa.Schema` is given, we want to convert and assign id (this is 
currently done by [setting the `assign_fresh_ids` flag to 
True](https://github.com/apache/iceberg-python/pull/1304/files#diff-276259d15e1e921df3b4783ec29c6390c828afa76bcf707f0f4923ff7f2d4194R862-R864))
   * If `Schema` is given, currently the default is to assign the schema ids,  
`assign_fresh_ids: bool = True`.
   
   



##########
pyiceberg/table/metadata.py:
##########
@@ -517,12 +517,15 @@ def new_table_metadata(
     location: str,
     properties: Properties = EMPTY_DICT,
     table_uuid: Optional[uuid.UUID] = None,
+    assign_fresh_ids: bool = True,
 ) -> TableMetadata:
     from pyiceberg.table import TableProperties
 
-    fresh_schema = assign_fresh_schema_ids(schema)
-    fresh_partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
-    fresh_sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+    if assign_fresh_ids:
+        fresh_schema = assign_fresh_schema_ids(schema)
+        partition_spec = assign_fresh_partition_spec_ids(partition_spec, 
schema, fresh_schema)
+        sort_order = assign_fresh_sort_order_ids(sort_order, schema, 
fresh_schema)
+        schema = fresh_schema

Review Comment:
   LMK if this makes sense or if im missing something! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Introduce `assign_fresh_ids` flag and allow skipping fresh assignment of IDs on Table creation [iceberg-python]

Reply via email to