Re: [DISCUSS] Replace table transaction in REST Catalog

Ryan Blue Thu, 26 Jun 2025 14:49:36 -0700

I think that we should definitely fix the partition spec bug. This is one
reason why we lazily bind specs in newer implementations, and there are
other ways to get into this situation because the table state where an
older partition spec references a column ID that no longer exists is
perfectly valid.


There's also a lot of value in the current behavior of REPLACE TABLE. It is
intended to allow you to easily replace the SQL-visible aspects of the
table (schema and data) while maintaining table history for time travel and
not requiring the user to re-create other metadata, like access control
policies.

It seems to me that the problem is that we never fixed the partition bug,
which has been outstanding for a long time. I think the right path forward
is to get Fokko's fix in rather than changing behavior.

On Thu, Jun 26, 2025 at 10:07 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I left some comments on
> https://github.com/apache/iceberg/issues/12738#issuecomment-3009087235
> but I think this is probably a good time to revisit whether we want to keep
> the "truncate + maintain state" behavior or switch to "drop + create". I
> think this will probably require a breaking change in the SDK and REST Spec
> but the Iceberg behavior has caused bugs in the past as well and I'm not
> sure the benefits really outweigh the unexpected behavior for users coming
> from other systems.
>
> I would like to poll and see how many folks actually are using/know about
> the Iceberg behavior we have now and are actively using it.
>
> On Thu, Jun 26, 2025 at 11:31 AM Maninderjit Singh <
> parmar.maninder...@gmail.com> wrote:
>
>> Thanks for the blazingly fast fix Fokko!
>>
>> The issue here is that it is hard to explain the replace behavior to end
>> users as well as implement it correctly for all the use cases and across
>> different database implementations that assume replace as drop+recreate.
>> Traditionally, to retain time travel information users rely on the
>> TRUNCATE statement which deletes all the data but ensures that
>> previous versions are time travelable. This makes the expectations crisp
>> about the ability to read the previous versions or tag or branches.
>>
>> For the alternative replace implementation, the expectations are no
>> different than the drop table where it is expected that time travel is not
>> possible and in-flight transactions would have to abort. This makes the
>> behavior easy to explain, implement and consistent with the drop table.
>>
>> Let me know what you think. Also, I would like to hear from other
>> database vendors about their experience to explain REPLACE semantics to end
>> users.
>>
>> Thanks,
>> Maninder
>>
>>
>> On Thu, Jun 26, 2025 at 12:10 AM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>>> Hey Maninder,
>>>
>>> Thanks for raising this. For the issue that you're hitting
>>> <https://github.com/apache/iceberg/issues/12738>, I have a PR that
>>> fixes that <https://github.com/apache/iceberg/pull/11868>, and it would
>>> be great to get in (maybe for 1.10.0? :).
>>>
>>> I do see the confusion with the approach that Iceberg is taking, but my
>>> main concern is that it essentially keeps time travel possible. It could be
>>> that a client is still reading an older version of the table, so we need to
>>> keep the metadata. Or that a process is reading a specific tag or branch of
>>> the table, which would be deleted with the approach you're suggesting.
>>>
>>> As suggested in the prior discussion
>>> <https://lists.apache.org/thread/s0c2wjdztzsh7nf8wf570kycoxxpnql3>, I
>>> think it would be good to add this to the implementation notes.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op do 26 jun 2025 om 05:18 schreef Maninderjit Singh <
>>> parmar.maninder...@gmail.com>:
>>>
>>>> Hi dev community,
>>>>
>>>> I'd like to open a discussion around the current REPLACE TABLE
>>>> transaction semantics, particularly how it's implemented via the REST
>>>> catalog, and propose clarifying the spec to allow more flexible semantics
>>>> to allow broader alignment with database and catalog vendors.
>>>>
>>>> Today, the replace table implementation
>>>> <https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/core/src/main/java/org/apache/iceberg/rest/RESTTableOperations.java#L129>
>>>>  for
>>>> REST catalog uses the updateTable
>>>> <https://github.com/apache/iceberg/blob/2cdff366982b30685f6410c290cbd16aed274caf/open-api/rest-catalog-open-api.yaml#L997>
>>>> API which retains the old table metadata (including UUID, schema,
>>>> properties, and snapshot), while resetting the current snapshot to -1.
>>>> From SQL point of view, this acts as a (TRUNCATE [1
>>>> <https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-truncate-table.html>
>>>> ][2 <https://docs.snowflake.com/en/sql-reference/sql/truncate-table>][3
>>>> <https://trino.io/docs/current/sql/truncate.html>][4
>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/truncate/>]
>>>> + Optional Schema evolution) rather than a commonly understood CREATE OR
>>>> REPLACE table behavior used in other database systems which define it as
>>>> drop and recreation of the table (eg: DuckDb
>>>> <https://duckdb.org/docs/stable/sql/statements/create_table.html#:~:text=The%20CREATE%20OR%20REPLACE%20syntax,then%20creating%20the%20new%20one.>,
>>>> Flink
>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/>,
>>>> Snowflake
>>>> <https://docs.snowflake.com/en/sql-reference/sql/create-table>,
>>>> BigQuery
>>>> <https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement>,
>>>> ClickHouse
>>>> <https://clickhouse.com/docs/sql-reference/statements/create/table>
>>>>  etc).
>>>> Similarly, CREATE OR REPLACE for other entities like views typically
>>>> implies dropping and recreating the view [1
>>>> <https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-view.html>
>>>> ][2 <https://docs.snowflake.com/en/sql-reference/sql/create-view>][3
>>>> <https://trino.io/docs/current/sql/create-view.html>]
>>>>
>>>> There has been prior discussion
>>>> <https://lists.apache.org/thread/bz4ohd1jbzctzdjlkn8d8fol3hzcfr29> 
>>>> regarding
>>>> this issue where iceberg's REST catalog's behavior has been defined as
>>>> "CREATE AND UPDATE" but there wasn't a clear consensus on whether this
>>>> should be the only allowed behavior. There were also requests to allow
>>>> additional flexibility since the current behavior is non-obvious to many
>>>> database vendors and introduces more complexity than required for the REST
>>>> catalog. For example, this github issue
>>>> <https://github.com/apache/iceberg/issues/12738> is a good example of
>>>> how subtle bugs can still exist with this implementation. I was able to
>>>> refine this to create a 3 simple statement example that would break
>>>> existing implementations.
>>>>
>>>> My ask from the community is to consider allowing more permissible
>>>> definition of replace table where databases vendors could choose to
>>>> implement it as drop + create. In future, we can consider creating a
>>>> separate endpoint to do drop+create atomically in the REST catalog. This
>>>> behavior has few advantages since it is very easy to explain to the end
>>>> user, easier to implement for all the cases and is more aligned with many
>>>> other database systems. For users that want to retain their snapshot
>>>> history, they can still consider using TRUNCATE to get the desired 
>>>> behavior.
>>>>
>>>> I’d love to hear the community’s thoughts on whether we can update the
>>>> implementation notes in the spec to allow having drop+create as an
>>>> acceptable behavior.
>>>>
>>>> Thanks,
>>>> Maninder
>>>>
>>>>
>>>>
>>>>
>>>>

Re: [DISCUSS] Replace table transaction in REST Catalog

Reply via email to