Re: [I] `system.add_files` utility does not support updated Partition Spec [iceberg]

via GitHub Thu, 21 Mar 2024 19:36:05 -0700


amogh-jahagirdar commented on issue #10008:
URL: https://github.com/apache/iceberg/issues/10008#issuecomment-2014209251


   I looked into this a bit and I think I know the problem. Here's a sample 
test that can be added to `TestAddFilesProcedure` to repro
   
   ```
     @TestTemplate
     public void addFilesPartitionEvolved() {
       createIcebergTable(
               "p1 int, p2 int, data int not null", "PARTITIONED BY (p1)");
   
       sql("ALTER TABLE %s ADD PARTITION FIELD p2", tableName);
       String createParquet =
               "CREATE TABLE %s (p1 int, p2 int, data int) USING %s "
                       + "PARTITIONED BY (p1, p2) LOCATION '%s'";
       sql(createParquet, sourceTableName, "parquet", 
fileTableDir.getAbsolutePath());
       sql("INSERT INTO %s PARTITION (p1=1, p2=10) VALUES (100)", 
sourceTableName);
       List<Object[]> result =
               sql(
                       "CALL %s.system.add_files('%s', '%s')",
                       catalogName, tableName, sourceTableName);
       sql("SELECT * FROM %s", tableName);
     }
   ```
   
   When we import the partitions, we derive an Icebeg partition spec from the 
hive style partitioning here 
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L430.
   
   This new partition spec will have a spec ID of 0 (the same spec ID as when 
you created the Iceberg table).
   This is the spec that gets used when writing the manifests here 
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L350
   
   But in the target Iceberg table, the spec ID with (p1, p2) is actually 1. 
   
   I'll need to think more about what the right solution is, but on the surface 
it seems like the right thing to do here is to 
   
   1.) Derive the partition spec from the source table partitioning.
   2.) See if that same schema exists in the target table 
   3.) If so build a copy of the derived partition spec but with the updated 
spec ID of the target table.
   
   But that seems too specific of a fix for this. I'm also not sure what the 
behavior of the procedure is if the partition spec on the target is completely 
different.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] `system.add_files` utility does not support updated Partition Spec [iceberg]

Reply via email to