Hi dev community, I'd like to open a discussion around the current REPLACE TABLE transaction semantics, particularly how it's implemented via the REST catalog, and propose clarifying the spec to allow more flexible semantics to allow broader alignment with database and catalog vendors.
Today, the replace table implementation <https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/core/src/main/java/org/apache/iceberg/rest/RESTTableOperations.java#L129> for REST catalog uses the updateTable <https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/open-api/rest-catalog-open-api.yaml#L997> API which retains the old table metadata (including UUID, schema, properties, and snapshot), while resetting the current snapshot to -1. >From SQL point of view, this acts as a (TRUNCATE [1 <https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-truncate-table.html> ][2 <https://docs.snowflake.com/en/sql-reference/sql/truncate-table>][3 <https://trino.io/docs/current/sql/truncate.html>][4 <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/truncate/>] + Optional Schema evolution) rather than a commonly understood CREATE OR REPLACE table behavior used in other database systems which define it as drop and recreation of the table (eg: DuckDb <https://duckdb.org/docs/stable/sql/statements/create_table.html#:~:text=The%20CREATE%20OR%20REPLACE%20syntax,then%20creating%20the%20new%20one.>, Flink <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/>, Snowflake <https://docs.snowflake.com/en/sql-reference/sql/create-table>, BigQuery <https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement>, ClickHouse <https://clickhouse.com/docs/sql-reference/statements/create/table> etc). Similarly, CREATE OR REPLACE for other entities like views typically implies dropping and recreating the view [1 <https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-view.html>][ 2 <https://docs.snowflake.com/en/sql-reference/sql/create-view>][3 <https://trino.io/docs/current/sql/create-view.html>] There has been prior discussion <https://lists.apache.org/thread/bz4ohd1jbzctzdjlkn8d8fol3hzcfr29> regarding this issue where iceberg's REST catalog's behavior has been defined as "CREATE AND UPDATE" but there wasn't a clear consensus on whether this should be the only allowed behavior. There were also requests to allow additional flexibility since the current behavior is non-obvious to many database vendors and introduces more complexity than required for the REST catalog. For example, this github issue <https://github.com/apache/iceberg/issues/12738> is a good example of how subtle bugs can still exist with this implementation. I was able to refine this to create a 3 simple statement example that would break existing implementations. My ask from the community is to consider allowing more permissible definition of replace table where databases vendors could choose to implement it as drop + create. In future, we can consider creating a separate endpoint to do drop+create atomically in the REST catalog. This behavior has few advantages since it is very easy to explain to the end user, easier to implement for all the cases and is more aligned with many other database systems. For users that want to retain their snapshot history, they can still consider using TRUNCATE to get the desired behavior. I’d love to hear the community’s thoughts on whether we can update the implementation notes in the spec to allow having drop+create as an acceptable behavior. Thanks, Maninder