namrathamyske opened a new issue, #9737:
URL: https://github.com/apache/iceberg/issues/9737
### Apache Iceberg version
main (development)
### Query engine
None
### Please describe the bug 🐞
regarding this PR: https://github.com/apache/iceberg/pull/9131 - the change
reads as: Schema for a branch should return table schema
Shouldn't the Schema of a branch be the same as when the branch was created
- as opposed to the above change - ie., to move it to a future state of schema
change on the table? isn't the concept of branching to create a baseline based
on the state of data and metadata of the table - as to - when it was branched?
can you pl. help me understand the rationale behind this change?
Please consider this example:
```
-- create a table with a single column and insert a value
spark-sql (default)> create table t (s string);
spark-sql (default)> insert into t values ('foo');
-- create a branch, the schema is the same as the original table
spark-sql (default)> alter table t create branch b1;
```
Describe and Query the table & branch:
```
spark-sql (default)> describe default.t;
s string
spark-sql (default)> select * from default.t;
s
foo
spark-sql (default)> describe default.t.branch_b1;
s string
spark-sql (default)> select * from default.t.branch_b1;
s
foo
```
Alter the table - using the below statement to diverge the definition of the
table:
```
spark-sql (default)> alter table t add column i int;
spark-sql (default)> alter table t del column s;
spark-sql (default)> insert into t values (111);
```
Behavior before the above PR: [Please NOTE that the changes in the main
branch - DID NOT IMPACT the data and metadata on the branch - which lookslike
is the desirable behavior for any branching concept]
```
spark-sql (default)> describe default.t;
i int
spark-sql (default)> select * from default.t;
i
111
spark-sql (default)> describe default.t.branch_b1;
s string
spark-sql (default)> select * from default.t.branch_b1;
s
foo
```
Behavior after the above PR: [Please NOTE that a schema change in the main
branch - IMPACTED the data and metadata available on the branch - this feels
like an undesirable behavior;]
```
spark-sql (default)> describe default.t;
i int
spark-sql (default)> select * from default.t;
i
111
spark-sql (default)> describe default.t.branch_b1;
i int
spark-sql (default)> select * from default.t.branch_b1;
i
--no-data--
```
Unit test to replicate the issue:
```
@Test
public void testSchemaChange() throws Exception {
Assume.assumeFalse("Avro does not support metadata delete",
fileFormat.equals("avro"));
createAndInitUnpartitionedTable();
sql("INSERT INTO TABLE %s VALUES (1, 'hr'), (2, 'hardware'), (null,
'hr')", tableName);
createBranchIfNeeded();
String sql = String.format("SELECT * FROM %s ORDER BY id",
selectTarget());
spark.sql(sql).show();
/**
* +----+--------+
* | id| dep|
* +----+--------+
* |NULL| hr|
* | 1| hr|
* | 2|hardware|
* +----+--------+
*/
// Metadata Delete
Table table = Spark3Util.loadIcebergTable(spark, tableName);
table.refresh();
table.updateSchema().deleteColumn("dep").commit();
sql = String.format("SELECT * FROM %s ORDER BY id", selectTarget());
spark.sql(sql).show();
/**
* Data loss in branch, impacted as we consume schema from table schema
* +----+
* | id|
* +----+
* |NULL|
* | 1|
* | 2|
* +----+
*/
sql = String.format("SELECT * FROM %s ORDER BY id", tableName);
spark.sql(sql).show();
/**
* +----+
* | id|
* +----+
* |NULL|
* | 1|
* | 2|
* +----+
*/
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]