Thanks for the blazingly fast fix Fokko!

The issue here is that it is hard to explain the replace behavior to end
users as well as implement it correctly for all the use cases and across
different database implementations that assume replace as drop+recreate.
Traditionally, to retain time travel information users rely on the TRUNCATE
statement which deletes all the data but ensures that previous versions are
time travelable. This makes the expectations crisp about the ability to
read the previous versions or tag or branches.

For the alternative replace implementation, the expectations are no
different than the drop table where it is expected that time travel is not
possible and in-flight transactions would have to abort. This makes the
behavior easy to explain, implement and consistent with the drop table.

Let me know what you think. Also, I would like to hear from other database
vendors about their experience to explain REPLACE semantics to end users.

Thanks,
Maninder


On Thu, Jun 26, 2025 at 12:10 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hey Maninder,
>
> Thanks for raising this. For the issue that you're hitting
> <https://github.com/apache/iceberg/issues/12738>, I have a PR that fixes
> that <https://github.com/apache/iceberg/pull/11868>, and it would be
> great to get in (maybe for 1.10.0? :).
>
> I do see the confusion with the approach that Iceberg is taking, but my
> main concern is that it essentially keeps time travel possible. It could be
> that a client is still reading an older version of the table, so we need to
> keep the metadata. Or that a process is reading a specific tag or branch of
> the table, which would be deleted with the approach you're suggesting.
>
> As suggested in the prior discussion
> <https://lists.apache.org/thread/s0c2wjdztzsh7nf8wf570kycoxxpnql3>, I
> think it would be good to add this to the implementation notes.
>
> Kind regards,
> Fokko
>
> Op do 26 jun 2025 om 05:18 schreef Maninderjit Singh <
> parmar.maninder...@gmail.com>:
>
>> Hi dev community,
>>
>> I'd like to open a discussion around the current REPLACE TABLE
>> transaction semantics, particularly how it's implemented via the REST
>> catalog, and propose clarifying the spec to allow more flexible semantics
>> to allow broader alignment with database and catalog vendors.
>>
>> Today, the replace table implementation
>> <https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/core/src/main/java/org/apache/iceberg/rest/RESTTableOperations.java#L129>
>>  for
>> REST catalog uses the updateTable
>> <https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/open-api/rest-catalog-open-api.yaml#L997>
>> API which retains the old table metadata (including UUID, schema,
>> properties, and snapshot), while resetting the current snapshot to -1.
>> From SQL point of view, this acts as a (TRUNCATE [1
>> <https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-truncate-table.html>
>> ][2 <https://docs.snowflake.com/en/sql-reference/sql/truncate-table>][3
>> <https://trino.io/docs/current/sql/truncate.html>][4
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/truncate/>]
>> + Optional Schema evolution) rather than a commonly understood CREATE OR
>> REPLACE table behavior used in other database systems which define it as
>> drop and recreation of the table (eg: DuckDb
>> <https://duckdb.org/docs/stable/sql/statements/create_table.html#:~:text=The%20CREATE%20OR%20REPLACE%20syntax,then%20creating%20the%20new%20one.>,
>> Flink
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/>,
>> Snowflake <https://docs.snowflake.com/en/sql-reference/sql/create-table>,
>> BigQuery
>> <https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement>,
>> ClickHouse
>> <https://clickhouse.com/docs/sql-reference/statements/create/table> etc).
>> Similarly, CREATE OR REPLACE for other entities like views typically
>> implies dropping and recreating the view [1
>> <https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-view.html>
>> ][2 <https://docs.snowflake.com/en/sql-reference/sql/create-view>][3
>> <https://trino.io/docs/current/sql/create-view.html>]
>>
>> There has been prior discussion
>> <https://lists.apache.org/thread/bz4ohd1jbzctzdjlkn8d8fol3hzcfr29> regarding
>> this issue where iceberg's REST catalog's behavior has been defined as
>> "CREATE AND UPDATE" but there wasn't a clear consensus on whether this
>> should be the only allowed behavior. There were also requests to allow
>> additional flexibility since the current behavior is non-obvious to many
>> database vendors and introduces more complexity than required for the REST
>> catalog. For example, this github issue
>> <https://github.com/apache/iceberg/issues/12738> is a good example of
>> how subtle bugs can still exist with this implementation. I was able to
>> refine this to create a 3 simple statement example that would break
>> existing implementations.
>>
>> My ask from the community is to consider allowing more permissible
>> definition of replace table where databases vendors could choose to
>> implement it as drop + create. In future, we can consider creating a
>> separate endpoint to do drop+create atomically in the REST catalog. This
>> behavior has few advantages since it is very easy to explain to the end
>> user, easier to implement for all the cases and is more aligned with many
>> other database systems. For users that want to retain their snapshot
>> history, they can still consider using TRUNCATE to get the desired behavior.
>>
>> I’d love to hear the community’s thoughts on whether we can update the
>> implementation notes in the spec to allow having drop+create as an
>> acceptable behavior.
>>
>> Thanks,
>> Maninder
>>
>>
>>
>>
>>

Reply via email to