[I] branch schema affected by main table schema [iceberg]

via GitHub Fri, 16 Feb 2024 09:25:38 -0800


namrathamyske opened a new issue, #9737:
URL: https://github.com/apache/iceberg/issues/9737


   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
    regarding this PR: https://github.com/apache/iceberg/pull/9131 - the change 
reads as: Schema for a branch should return table schema
   Shouldn't the Schema of a branch be the same as when the branch was created 
- as opposed to the above change - ie., to move it to a future state of schema 
change on the table? isn't the concept of branching to create a baseline based 
on the state of data and metadata of the table - as to - when it was branched? 
can you pl. help me understand the rationale behind this change?
   
   Please consider this example:
   ```
   -- create a table with a single column and insert a value
   spark-sql (default)> create table t (s string);
   spark-sql (default)> insert into t values ('foo');
   -- create a branch, the schema is the same as the original table
   spark-sql (default)> alter table t create branch b1;
   ```
   
   Describe and Query the table & branch:
   ```
   spark-sql (default)> describe default.t;
   s                       string
   spark-sql (default)> select * from default.t;
   s
   foo
   
   spark-sql (default)> describe default.t.branch_b1;
   s                       string
   spark-sql (default)> select * from default.t.branch_b1;
   s
   foo
   ```
   
   Alter the table - using the below statement to diverge the definition of the 
table:
   
   ```
   spark-sql (default)> alter table t add column i int;
   spark-sql (default)> alter table t del column s;
   
   spark-sql (default)> insert into t values (111);
   ```
   
   Behavior before the above PR: [Please NOTE that the changes in the main 
branch - DID NOT IMPACT the data and metadata on the branch - which lookslike 
is the desirable behavior for any branching concept]
   
   ```
   spark-sql (default)> describe default.t;
   i                       int
   spark-sql (default)> select * from default.t;
   i
   111
   
   spark-sql (default)> describe default.t.branch_b1;
   s                       string
   spark-sql (default)> select * from default.t.branch_b1;
   s
   foo
   ```
   
   Behavior after the above PR: [Please NOTE that a schema change in the main 
branch - IMPACTED the data and metadata available on the branch - this feels 
like an undesirable behavior;]
   
   ```
   spark-sql (default)> describe default.t;
   i                       int
   spark-sql (default)> select * from default.t;
   i
   111
   
   spark-sql (default)> describe default.t.branch_b1;
   i                       int
   spark-sql (default)> select * from default.t.branch_b1;
   i
   --no-data--
   ```
   
   Unit test to replicate the issue:
   
   ```
   @Test
     public void testSchemaChange() throws Exception {
       Assume.assumeFalse("Avro does not support metadata delete", 
fileFormat.equals("avro"));
       createAndInitUnpartitionedTable();
   
       sql("INSERT INTO TABLE %s VALUES (1, 'hr'), (2, 'hardware'), (null, 
'hr')", tableName);
       createBranchIfNeeded();
   
       String sql = String.format("SELECT * FROM %s ORDER BY id", 
selectTarget());
       spark.sql(sql).show();
       /**
        * +----+--------+
        * |  id|     dep|
        * +----+--------+
        * |NULL|      hr|
        * |   1|      hr|
        * |   2|hardware|
        * +----+--------+
        */
       // Metadata Delete
       Table table = Spark3Util.loadIcebergTable(spark, tableName);
       table.refresh();
       table.updateSchema().deleteColumn("dep").commit();
       sql = String.format("SELECT * FROM %s ORDER BY id", selectTarget());
       spark.sql(sql).show();
       /**
        * Data loss in branch, impacted as we consume schema from table schema
        * +----+
        * |  id|
        * +----+
        * |NULL|
        * |   1|
        * |   2|
        * +----+
        */
       sql = String.format("SELECT * FROM %s ORDER BY id", tableName);
       spark.sql(sql).show();
       /**
        * +----+
        * |  id|
        * +----+
        * |NULL|
        * |   1|
        * |   2|
        * +----+
        */
     }
     ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] branch schema affected by main table schema [iceberg]

Reply via email to