Hi everyone,

As we are starting to have more use cases around the time travel related
features in Iceberg, a few questions popped up around the correct schema to
use for time travel queries.

*Latest table schema vs schema at snapshot time*
There was a discussion around this in
https://github.com/apache/iceberg/pull/1508, but I would like to just
revisit it a bit, because in many applications the schema at read time is
fixed, and the same schema might be desirable for querying an old snapshot.
Considering the following case:

t0: create table sample (id bigint, category string, data string)
t1: insert into sample (1, 'c1', data1'), (2, 'c2', 'data2')
t2: alter table sample add column ts timestamp
t3: insert into sample (3, 'c3', 'dat3', timestamp '2000-01-01')

The application that reads the table might already be updated to query
columns (id, data, ts), and just expect the value of ts to be null for data
before t2 when it performs a time travel. However, if time travel to t1
gives data in the old schema, it will not find the ts column and fail the
query. So things get a lot more complex at application side because it
needs to distinguish between the query as-of-time and use different logic
to handle data. This gets very convoluted very quickly as more schema
evolutions are added.

While I do see the great value of using schema at snapshot time in many
applications and I am all for this feature, I think for some applications
that means we are pushing a lot of complexity of handling multi-schema to
the users. Maybe some kind of option is needed for users to better control
the behavior to fit their specific use cases.

*Schema for time-based travel*
As of today, we have 2 time travel syntaxes (use Hive syntax as example):
- SELECT * FROM table FOR SYSTEM_TIME AS OF timestamp
- SELECT * FROM table FOR SYSTEM_VERSION AS OF version

For SYSTEM_VERSION based travel, it is pretty clear that we will time
travel to a specific snapshot based on the snapshot ID, and the schema can
be resolved by the solution of the last question. However, for SYSTEM_TIME
based travel, there is ambiguity around the exact schema to use, because
Iceberg does not create a new snapshot for table metadata operations like
schema evolution.

Consider the same use case example as above, If the user queries SELECT *
FROM table FOR SYSTEM_TIME AS OF t2, then it is ambiguous if we should use
the schema of the snapshot at t1, or the updated schema at t2. Currently
the behavior is to use a schema of snapshot at t1, which might not be the
intention of the user.

*Thoughts about the solution*
Considering the 2 cases above, it seems to me that the most flexible way to
resolve this is to provide users a way to override the schema to use for
time travel queries. The default can be kept as using the snapshot schema,
but another schema could be used if the user specifies the schema ID. This
provides a lot of flexibility for use cases like table migration and schema
evolution coordination across microservices.

Any thoughts around this?

Best,
Jack Ye

Reply via email to