Hey Maninder, Thanks for raising this. For the issue that you're hitting <https://github.com/apache/iceberg/issues/12738>, I have a PR that fixes that <https://github.com/apache/iceberg/pull/11868>, and it would be great to get in (maybe for 1.10.0? :).
I do see the confusion with the approach that Iceberg is taking, but my main concern is that it essentially keeps time travel possible. It could be that a client is still reading an older version of the table, so we need to keep the metadata. Or that a process is reading a specific tag or branch of the table, which would be deleted with the approach you're suggesting. As suggested in the prior discussion <https://lists.apache.org/thread/s0c2wjdztzsh7nf8wf570kycoxxpnql3>, I think it would be good to add this to the implementation notes. Kind regards, Fokko Op do 26 jun 2025 om 05:18 schreef Maninderjit Singh < parmar.maninder...@gmail.com>: > Hi dev community, > > I'd like to open a discussion around the current REPLACE TABLE > transaction semantics, particularly how it's implemented via the REST > catalog, and propose clarifying the spec to allow more flexible semantics > to allow broader alignment with database and catalog vendors. > > Today, the replace table implementation > <https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/core/src/main/java/org/apache/iceberg/rest/RESTTableOperations.java#L129> > for > REST catalog uses the updateTable > <https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/open-api/rest-catalog-open-api.yaml#L997> > API which retains the old table metadata (including UUID, schema, > properties, and snapshot), while resetting the current snapshot to -1. > From SQL point of view, this acts as a (TRUNCATE [1 > <https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-truncate-table.html> > ][2 <https://docs.snowflake.com/en/sql-reference/sql/truncate-table>][3 > <https://trino.io/docs/current/sql/truncate.html>][4 > <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/truncate/>] > + Optional Schema evolution) rather than a commonly understood CREATE OR > REPLACE table behavior used in other database systems which define it as > drop and recreation of the table (eg: DuckDb > <https://duckdb.org/docs/stable/sql/statements/create_table.html#:~:text=The%20CREATE%20OR%20REPLACE%20syntax,then%20creating%20the%20new%20one.>, > Flink > <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/>, > Snowflake <https://docs.snowflake.com/en/sql-reference/sql/create-table>, > BigQuery > <https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement>, > ClickHouse > <https://clickhouse.com/docs/sql-reference/statements/create/table> etc). > Similarly, CREATE OR REPLACE for other entities like views typically > implies dropping and recreating the view [1 > <https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-view.html> > ][2 <https://docs.snowflake.com/en/sql-reference/sql/create-view>][3 > <https://trino.io/docs/current/sql/create-view.html>] > > There has been prior discussion > <https://lists.apache.org/thread/bz4ohd1jbzctzdjlkn8d8fol3hzcfr29> regarding > this issue where iceberg's REST catalog's behavior has been defined as > "CREATE AND UPDATE" but there wasn't a clear consensus on whether this > should be the only allowed behavior. There were also requests to allow > additional flexibility since the current behavior is non-obvious to many > database vendors and introduces more complexity than required for the REST > catalog. For example, this github issue > <https://github.com/apache/iceberg/issues/12738> is a good example of how > subtle bugs can still exist with this implementation. I was able to refine > this to create a 3 simple statement example that would break existing > implementations. > > My ask from the community is to consider allowing more permissible > definition of replace table where databases vendors could choose to > implement it as drop + create. In future, we can consider creating a > separate endpoint to do drop+create atomically in the REST catalog. This > behavior has few advantages since it is very easy to explain to the end > user, easier to implement for all the cases and is more aligned with many > other database systems. For users that want to retain their snapshot > history, they can still consider using TRUNCATE to get the desired behavior. > > I’d love to hear the community’s thoughts on whether we can update the > implementation notes in the spec to allow having drop+create as an > acceptable behavior. > > Thanks, > Maninder > > > > >