Hi dev community,

I'd like to open a discussion around the current REPLACE TABLE transaction
semantics, particularly how it's implemented via the REST catalog, and
propose clarifying the spec to allow more flexible semantics to allow
broader alignment with database and catalog vendors.

Today, the replace table implementation
<https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/core/src/main/java/org/apache/iceberg/rest/RESTTableOperations.java#L129>
for
REST catalog uses the updateTable
<https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/open-api/rest-catalog-open-api.yaml#L997>
API which retains the old table metadata (including UUID, schema,
properties, and snapshot), while resetting the current snapshot to -1.
>From SQL point of view, this acts as a (TRUNCATE [1
<https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-truncate-table.html>
][2 <https://docs.snowflake.com/en/sql-reference/sql/truncate-table>][3
<https://trino.io/docs/current/sql/truncate.html>][4
<https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/truncate/>]
+ Optional Schema evolution) rather than a commonly understood CREATE OR
REPLACE table behavior used in other database systems which define it as
drop and recreation of the table (eg: DuckDb
<https://duckdb.org/docs/stable/sql/statements/create_table.html#:~:text=The%20CREATE%20OR%20REPLACE%20syntax,then%20creating%20the%20new%20one.>,
Flink
<https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/>,
Snowflake <https://docs.snowflake.com/en/sql-reference/sql/create-table>,
BigQuery
<https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement>,
ClickHouse
<https://clickhouse.com/docs/sql-reference/statements/create/table> etc).
Similarly, CREATE OR REPLACE for other entities like views typically
implies dropping and recreating the view [1
<https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-view.html>][
2 <https://docs.snowflake.com/en/sql-reference/sql/create-view>][3
<https://trino.io/docs/current/sql/create-view.html>]

There has been prior discussion
<https://lists.apache.org/thread/bz4ohd1jbzctzdjlkn8d8fol3hzcfr29> regarding
this issue where iceberg's REST catalog's behavior has been defined as
"CREATE AND UPDATE" but there wasn't a clear consensus on whether this
should be the only allowed behavior. There were also requests to allow
additional flexibility since the current behavior is non-obvious to many
database vendors and introduces more complexity than required for the REST
catalog. For example, this github issue
<https://github.com/apache/iceberg/issues/12738> is a good example of how
subtle bugs can still exist with this implementation. I was able to refine
this to create a 3 simple statement example that would break existing
implementations.

My ask from the community is to consider allowing more permissible
definition of replace table where databases vendors could choose to
implement it as drop + create. In future, we can consider creating a
separate endpoint to do drop+create atomically in the REST catalog. This
behavior has few advantages since it is very easy to explain to the end
user, easier to implement for all the cases and is more aligned with many
other database systems. For users that want to retain their snapshot
history, they can still consider using TRUNCATE to get the desired behavior.

I’d love to hear the community’s thoughts on whether we can update the
implementation notes in the spec to allow having drop+create as an
acceptable behavior.

Thanks,
Maninder

Reply via email to