[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

GitBox Mon, 29 Mar 2021 16:24:04 -0700


rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r603676375




##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -529,7 +530,8 @@ public TableMetadata updatePartitionSpec(PartitionSpec 
newPartitionSpec) {
         .addAll(specs);
     if (!specsById.containsKey(newDefaultSpecId)) {
       // get a fresh spec to ensure the spec ID is set to the new default
-      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec));
+      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec, 
formatVersion,
+          specs, new AtomicInteger(lastAssignedPartitionId)));

Review comment:
       The problem from [my 
comment](https://github.com/apache/iceberg/pull/2089#discussion_r567154516) 
that needs to be fixed does not affect this method. The logic in 
`BaseUpdatePartitionSpec` already covers spec evolution. The problem I was 
referring to is that when a table replacement is created, all of the metadata 
is new and can possibly conflict with the existing table's metadata.
   
   For example, if I have a table partitioned by `1000: ts_day=day(ts)` and I 
replace it with one partitioned by `1000: id_bucket=bucket(16, id)` then the 
two partition IDs collide. The `buildReplacement` method needs to handle 
reassigning IDs for partition specs just like it does for schemas.
   
   The specs that are passed into `updatePartitionSpec` are already based on 
the current spec and so the IDs are assigned correctly and are consistent with 
all previous table specs. There is no "bug" to fix in this method. However, 
there is an improvement that we can make that is similar. Currently, if you add 
a partition field that was previously part of the table but is not any longer, 
it is assigned a completely new ID.
   
   For example, if I have a table partitioned by `1000: ts_day=day(ts)` and 
then update it by dropping the `ts_day` partition, it would have one spec with 
`ts_day` and one that is unpartitioned. Then if I go back and add `day(ts)` 
again, it would create `1001: ts_day=day(ts)`. That is perfectly fine and 
consistent, but it would be better to go back and reuse ID 1000 for the new 
field.
   
   We can detect that fields should be un-deleted rather than added by going 
back to through old partition specs for the table. That will recover the old 
ID, but there is an additional issue to solve: when un-deleting a partition 
field in v1, we need to make sure that the un-deleted field replaces the `void` 
field that was added when the field was deleted. For v2, it would also be nice 
to use the original partition field order but that's not required.
   
   I think that this improvement is a good thing to do, but should not be part 
of this change because this PR is needed to fix the blocker bug with 
`buildReplacement`. Also, I think that the improvement for spec updates should 
be built into `BaseUpdatePartitionSpec` rather than done in table metadata.
   
   Could you separate this into another PR, @jun-he?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Reply via email to