Re: Subject: [DISCUSS] Idempotency-Key design for Iceberg REST: converging on Model B

Dmitri Bourlatchkov Tue, 02 Jun 2026 06:59:14 -0700

Hi Huaxin,

Good point about handling DELETE idempotently!


However, I wonder whether it is a critical use case?

Do you expect DELETE to benefit a lot from the Idempotency Key?

I'd think it should be fairly straightforward for the client to reload the
table to be deleted in case of failures, discover that it is gone, and not
retry. WDYT?

There's still the question of whether the client is deleting the table it
actually intends to delete. Another client could delete the current table
and create a new table under the same name while the first client is
"deliberating". The IRC API does not provide for unique table
identification in DELETE operations, as far as I know. The operation is
invoked simply on the name, which can map to different physical tables at
different times. Adding Idempotency Keys does not help in this context, I
think.

Thanks,
Dmitri.

On Mon, Jun 1, 2026 at 9:59 PM huaxin gao <[email protected]> wrote:

> Hi Dmitri,
>
> I like the idea — the atomic key write closes the in-flight gap, and it
> avoids the Iceberg metadata and spec issues. Agreed too that losing keys
> on already-deleted entities is harmless.
>
> But I think the harder case is delete operations themselves. For drop
> table/view/namespace, the operation removes the entity, so there is no
> surviving entity to hold the key. A retry of a successful drop should
> return an equivalent success, but with entity-property storage the key
> has nowhere to live — so the retry would just see "not found" and behave
> differently. Where would a drop's key live in this model?
>
> Thanks,
> Huaxin
>
> On Mon, Jun 1, 2026 at 6:13 PM Yufei Gu <[email protected]> wrote:
>
> > One concern I have with storing idempotency records as entity properties
> is
> > the potential performance impact. Over time, an entity could have a large
> > number of idempotency key/value pairs. That would increase the entity's
> > size, which may affect load, update, serialization, and caching costs for
> > normal catalog operations, even when idempotency is not involved. Use
> cases
> > such as table loading and entity in-memory caching could be affected.
> > Before moving in that direction, I think it would be useful to better
> > understand and measure the performance implications. If the entity size
> > growth turns out to be negligible in practice, the approach may still be
> > attractive because of its transactional simplicity.
> >
> > Yufei
> >
> >
> > On Mon, Jun 1, 2026 at 2:17 PM Dmitri Bourlatchkov <[email protected]>
> > wrote:
> >
> > > Hi Huaxin,
> > >
> > > How about storing idempotency keys in the Polaris Entity properties
> (not
> > > Iceberg metadata)?
> > >
> > > I understand that entities can be deleted thus discarding previously
> > > recorded keys, but based on the use cases discussed so far, it does not
> > > look like deleted entities should be a functional concern.
> > >
> > > Storing idempotency keys inside the entity will ensure that their
> updates
> > > are processed in the same logical change set as the entity changes from
> > the
> > > IRC request payload.
> > >
> > > This will ensure uniform operations across all Persistence
> > implementations
> > > and will not require any Idempotency-specific Persistence changes.
> > >
> > > WDYT?
> > >
> > > Thanks,
> > > Dmitri.
> > >
> > > On Sun, May 31, 2026 at 2:35 PM huaxin gao <[email protected]>
> > wrote:
> > >
> > > > Hi Dmitri, Robert,
> > > >
> > > > Thanks both.
> > > >
> > > > Dmitri — I agree with both of your points.
> > > >
> > > >   - Idempotency storage will stay separate from the metastore. It
> will
> > > >     be separate in code and in transactions. We make the idempotency
> > > >     decision before the handler runs, or after it commits — never
> > inside
> > > >     the metastore transaction.
> > > >   - I'll document the assumption you raised. Model B is only as
> strict
> > as
> > > >     the spec wants if the client builds the request so that at most
> one
> > > >     try can commit (for example, update requirements). The catalog's
> > > >     optimistic concurrency makes sure of this. Model B just records
> the
> > > >     result on top of it. I'll say this clearly in the Polaris docs.
> > > >
> > > > Robert — I see why the operation-id-in-metadata idea is appealing. If
> > we
> > > > write the id inside the commit, it is atomic with the change. That
> > would
> > > > close the in-flight gap for table and view operations. That is a real
> > > > plus.
> > > >
> > > > But I don't think we should put the idempotency key in table
> metadata.
> > > > Here is why:
> > > >
> > > > 1. It only works for table and view operations. It can't help
> namespace
> > > >    operations, grants, or other writes. A separate store handles all
> of
> > > >    them with one mechanism.
> > > >
> > > > 2. It mixes two concerns. Idempotency is a REST/catalog concern.
> Table
> > > >    metadata should describe the table — schema, snapshots,
> > partitioning,
> > > >    sort order. A per-request id is not table state. I'd rather not
> mix
> > > >    the two.
> > > >
> > > > 3. It bloats the metadata. To support retries we'd have to keep
> > > >    operation-ids with some retention/TTL. metadata.json is rewritten
> on
> > > >    every commit and read on every table load. For tables with many
> > > >    writes, this adds real cost. And every client and engine that
> reads
> > > >    the table pays it, not just the idempotency path.
> > > >
> > > > 4. It doesn't match the spec. The Iceberg REST spec defines
> idempotency
> > > >    at the protocol layer — an Idempotency-Key header with a
> server-side
> > > >    contract. It does not store idempotency in table metadata. Putting
> > an
> > > >    operation-id there would be a new mechanism that isn't in the spec
> > > >    today. So it's a change to how the spec
> > > >    works, and a cross-project change too.
> > > >
> > > > So I'd prefer to keep the record in a separate idempotency store. We
> > > > accept the in-flight gap, but it is bounded. The catalog's optimistic
> > > > concurrency stops a duplicate commit from landing. And once a record
> > > > exists, retries replay cleanly.
> > > >
> > > > Thanks,
> > > > Huaxin
> > > >
> > > > On Sat, May 30, 2026 at 3:15 AM Robert Stupp <[email protected]> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Thanks for the clarifications. Russell's explanation is especially
> > > > useful.
> > > > > I agree, ambiguous request outcomes, for example, timeouts or
> network
> > > > > connections being reset, are hard to reason about.
> > > > >
> > > > > Clients often cannot reliably reconcile from the current state
> alone
> > > for
> > > > > table/view state mutating operations.
> > > > >
> > > > > I wonder whether the idempotency key should be recorded in the
> > > table/view
> > > > > metadata as an "operation-id", with an explicit retention
> guarantee,
> > > > maybe
> > > > > tied to a server-provided minimum TTL.
> > > > > The approach could reduce or change the role of a separate
> > > > > idempotency-record table and handling of it.
> > > > >
> > > > > Request handling could roughly look like this:
> > > > >   if the current history/metadata already contains that
> > "operation-id",
> > > > >     return equivalent-enough response without re-running the
> > operation.
> > > > >
> > > > >   try the committing operation:
> > > > >   if the commit succeeds:
> > > > >     record the "operation-id" in the table/view metadata, and
> > > > >     return the successful response.
> > > > >   if the commit runs into a conflict:
> > > > >     re-check whether the current metadata/history contains that
> > > > > "operation-id"
> > > > >     if so:
> > > > >       return equivalent-enough response.
> > > > >     otherwise:
> > > > >       return the conflict response.
> > > > >
> > > > > This is not perfect either and needs spec work, retention rules,
> and
> > > may
> > > > > only work for table and view operations.
> > > > >
> > > > > I mostly want to separate the questions:
> > > > > 1. What guarantees do clients actually need after an ambiguous
> > outcome?
> > > > > 2. Where should the durable evidence for the guarantee live?
> > > > >
> > > > > Robert
> > > > >
> > > > > On Sat, May 30, 2026 at 4:30 AM Dmitri Bourlatchkov <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Russell,
> > > > > >
> > > > > > Thanks for the information! It clarifies the use case a lot (at
> > least
> > > > for
> > > > > > me :)
> > > > > >
> > > > > > In short, I'd say the main benefit is allowing clients to avoid
> > > > conflicts
> > > > > > (409) on re-submitting changes that got committed by the server
> > > without
> > > > > the
> > > > > > client receiving confirmation of the success.
> > > > > >
> > > > > > I believe the Iceberg REST Catalog spec [1] is formally stricter
> > than
> > > > > Model
> > > > > > B when it states "the server ensures no additional effects for
> > > requests
> > > > > > that carry the same Idempotency-Key". Since Model B permits
> request
> > > > > > re-execution, the possibility of additional side effects cannot
> be
> > > > ruled
> > > > > > out completely based on the proposed server-side algorithm alone.
> > The
> > > > > > server must assume that the client forms the (change) request in
> > > such a
> > > > > way
> > > > > > that only one execution attempt can succeed (e.g. by using
> "update
> > > > > > requirements"). This is also mentioned in  comments on the doc
> [2].
> > > > > >
> > > > > > This is probably worth mentioning in the Polaris docs related to
> > > > > > our Idempotency-Key implementation.
> > > > > >
> > > > > > Assuming this kind of cooperation on the client side, I believe
> > > Model B
> > > > > can
> > > > > > be considered compliant with the spec [1].
> > > > > >
> > > > > > In anticipation of fresh implementation PRs for this feature, I'd
> > > like
> > > > to
> > > > > > re-emphasize (IIRC I mentioned this before) that, I think, we
> > should
> > > > > avoid
> > > > > > coupling Idempotency persistence with MetaStore persistence (both
> > > > > code-wise
> > > > > > and transaction-wise). Model B processes Idempotency-related data
> > > > outside
> > > > > > the original change request's execution scope. Idempotency
> > decisions
> > > > are
> > > > > > made either before the request starts executing or after it is
> > > > committed
> > > > > to
> > > > > > the MetaStore.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/polaris/blob/4e4eaf840bf71d431b13034b0dd6f338261d8e8b/spec/iceberg-rest-catalog-open-api.yaml#L2098
> > > > > >
> > > > > > [2]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1hqTejVyYXDpL5MJcVc7NyhCslKaGH82QoqMEcUYPvkE/edit?tab=t.0
> > > > > >
> > > > > > Cheers,
> > > > > > Dmitri.
> > > > > >
> > > > > > On Fri, May 29, 2026 at 8:26 PM Russell Spitzer <
> > > > > [email protected]
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > The problem with a client attempting to determine if it’s
> > > operations
> > > > > > > succeeded via  load table, and the reason all this work has
> > > > proceeded,
> > > > > is
> > > > > > > that there is no way for a client to guaranteed path to
> actually
> > > > > > determine
> > > > > > > if a commit occurred. There are too many legitimate mechanisms
> to
> > > > erase
> > > > > > > history from an Iceberg table to guarantee an operation
> occurred.
> > > > > > >
> > > > > > > For example, you could check if your snapshot exists in
> snapshot
> > > > > history
> > > > > > > but this could have been erased by expire snapshots.
> > > > > > >
> > > > > > > Or you could check if the schema was modified according to your
> > > > update,
> > > > > > but
> > > > > > > this too could have been undone by another operation. Client A
> > adds
> > > > > > column
> > > > > > > but gets time out, Client B removes the Column, Client A
> retries
> > > and
> > > > > adds
> > > > > > > the column again.
> > > > > > >
> > > > > > > Because of this the Iceberg client usually just bails out to he
> > > user
> > > > > with
> > > > > > > an exception if it doesn’t get an actual confirmation that the
> > > commit
> > > > > > > succeeded from the server. This leaves the “can I retry or not”
> > as
> > > an
> > > > > > > exercise to the end user.
> > > > > > >
> > > > > > > In practice, actual Iceberg users work around this sort of
> thing
> > by
> > > > > > adding
> > > > > > > all sorts of custom metadata to hopefully persist history in
> the
> > > > table
> > > > > > > itself in some way that can’t be touched by expire snapshots,
> but
> > > > this
> > > > > is
> > > > > > > usually very fragile and also relies on all clients behaving
> > well.
> > > > I’ve
> > > > > > > seen folks use custom table properties for example “batch-5:
> > > > committed”
> > > > > > > then manually have their own retry logic check whether this
> > > property
> > > > is
> > > > > > > set. Then, of course, they also have to add a bunch custom
> logic
> > to
> > > > > make
> > > > > > > sure they clean up this state as well.
> > > > > > >
> > > > > > > This is why Iceberg added the Idempotency path in the first
> > place,
> > > it
> > > > > > gives
> > > > > > > us a guaranteed way for clients to retry in case of a network
> > issue
> > > > or
> > > > > > > catalog issue with a guarantee they will not do duplicate work
> be
> > > > > > retrying.
> > > > > > > With this in place the client can now cleanly retry (within the
> > > > > > idempotency
> > > > > > > window) the same operation over and over without throwing an
> > > > exception
> > > > > to
> > > > > > > the end user. Only in a situation where the catalog cannot
> > respond
> > > > > over a
> > > > > > > very long time will the user actually have to do some sort of
> > > > > > > reconciliation. You can look at the history of the Iceberg
> > client’s
> > > > > retry
> > > > > > > behavior with ambiguous server side or network errors to see
> how
> > > this
> > > > > has
> > > > > > > been a problem in the past.
> > > > > > >
> > > > > > > On Fri, May 29, 2026 at 1:24 PM huaxin gao <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Robert,
> > > > > > > >
> > > > > > > > Thanks for your reply!
> > > > > > > >
> > > > > > > > You're right that Model B does not prevent duplicate
> execution.
> > > The
> > > > > > > > record is written only after success. So if a client times
> out
> > > > while
> > > > > > the
> > > > > > > > first request is still running, a retry can run the handler
> > > again.
> > > > > > There
> > > > > > > > is no record yet to stop it. So Model B is "remember and
> > replay a
> > > > > > > > successful result," not "run exactly once."
> > > > > > > >
> > > > > > > > On the trade-off: Model A gives a stronger guarantee, but it
> > > needs
> > > > > > > > reserve/heartbeat/purge state, which adds complexity and
> > > overhead.
> > > > > > Model
> > > > > > > > B is simpler and cheaper. The window it leaves open is small,
> > > and a
> > > > > > > > client only retries after a timeout, so racing first requests
> > > > should
> > > > > be
> > > > > > > > rare in practice. Every design is a trade-off, and my view is
> > > that
> > > > > > Model
> > > > > > > > B is the right one here.
> > > > > > > >
> > > > > > > > It also helps to be clear about where duplicate-work
> protection
> > > > > really
> > > > > > > > comes from. It comes from the catalog itself, not from
> > > idempotency.
> > > > > The
> > > > > > > > catalog uses optimistic concurrency. If wo first attempts
> race,
> > > at
> > > > > most
> > > > > > > > one commit wins and the other gets a 409. Idempotency sits on
> > top
> > > > of
> > > > > > > that.
> > > > > > > > It does not replace it.
> > > > > > > >
> > > > > > > > So what does Model B add over "the client just calls
> loadTable
> > > and
> > > > > > > > reconciles"? Two things that I think are real:
> > > > > > > >
> > > > > > > >   1. The 422 check. loadTable can tell a client that a table
> > > > exists.
> > > > > It
> > > > > > > >      cannot tell the client that the table THEY created with
> > THIS
> > > > key
> > > > > > is
> > > > > > > >      the one that succeeded. The record binds the key to
> > > > (principal,
> > > > > > > >      operation, resource). If the same key is reused for a
> > > > different
> > > > > > > >      request, the server returns 422. The client cannot
> detect
> > > this
> > > > > on
> > > > > > > >      its own.
> > > > > > > >
> > > > > > > >   2. One server-side behavior for all mutating ops.
> > create-table
> > > > > > happens
> > > > > > > >      to reconcile cleanly with loadTable. But the point of
> the
> > > > > > > >      Idempotency-Key header is that the client should not
> have
> > to
> > > > > write
> > > > > > > >      reconciliation logic for every operation. For a known
> key,
> > > the
> > > > > > > >      server turns what would be a 409 into an equivalent 2xx
> > > > replay.
> > > > > > The
> > > > > > > >      client gets a clean success instead of an error it has
> to
> > > > > special-
> > > > > > > >      case.
> > > > > > > >
> > > > > > > > There is a third, weaker benefit: once a record exists,
> retries
> > > > stop
> > > > > > > > seeing flip-flopping results. But that only helps after a
> > record
> > > > > > exists,
> > > > > > > > which is exactly the window you pointed out is unprotected.
> > > > > > > >
> > > > > > > > So I'll correct my earlier wording. This is not convergence
> on
> > > > > exactly-
> > > > > > > > once idempotency. It is a narrower guarantee: replay a
> recorded
> > > > > result,
> > > > > > > > plus detect key misuse. It sits on top of the catalog's
> > existing
> > > > > > > > concurrency control. The real question for the list is
> simple:
> > is
> > > > > that
> > > > > > > > narrower guarantee worth shipping on its own? Or do we need
> > Model
> > > > A's
> > > > > > > > in-flight protection to have a strong idempotency guarantee?
> > > > > > > >
> > > > > > > > My view is that the narrow version is worth it for now: it's
> > the
> > > > > > > > behavior the spec asks for, the 422 check can't be done
> > > > client-side,
> > > > > > and
> > > > > > > > it's a small change we can strengthen toward Model A later
> if a
> > > > real
> > > > > > use
> > > > > > > > case needs it. Happy to hear what others think.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Huaxin
> > > > > > > >
> > > > > > > > On Fri, May 29, 2026 at 7:36 AM Robert Stupp <[email protected]
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Huaxin,
> > > > > > > > >
> > > > > > > > > Thanks for writing this up and moving the design discussion
> > > back
> > > > to
> > > > > > > dev@
> > > > > > > > .
> > > > > > > > >
> > > > > > > > > Since you’re asking before locking in the implementation, I
> > > think
> > > > > we
> > > > > > > > should
> > > > > > > > > clarify one point.
> > > > > > > > >
> > > > > > > > > Model B is certainly simpler than the lease-based approach,
> > but
> > > > I’m
> > > > > > not
> > > > > > > > > sure I fully understand what problem it still solves.
> > > > > > > > >
> > > > > > > > > As I read it, if a client times out while the original
> > request
> > > is
> > > > > > still
> > > > > > > > > running, a retry with the same key may not see an
> idempotency
> > > > > record
> > > > > > > yet
> > > > > > > > > and could run the handler again.
> > > > > > > > > So this feels less like preventing duplicate execution and
> > more
> > > > > like
> > > > > > > > > remembering a successful result after the fact.
> > > > > > > > >
> > > > > > > > > For the create-table case, couldn’t a client achieve
> roughly
> > > the
> > > > > same
> > > > > > > > > recovery by calling loadTable after an ambiguous timeout
> and
> > > > > > > reconciling
> > > > > > > > > from there?
> > > > > > > > > Since Model B also rebuilds the response from current
> catalog
> > > > > state,
> > > > > > > I’m
> > > > > > > > > trying to understand what it gives us beyond that.
> > > > > > > > >
> > > > > > > > > I’m not against simplifying the design, but I think we
> should
> > > be
> > > > > > clear
> > > > > > > > > about the narrower guarantee before calling this
> convergence.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Robert
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, May 29, 2026 at 12:29 AM huaxin gao <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I've simplified the proposed design for Idempotency-Key
> > > support
> > > > > in
> > > > > > > > > Polaris
> > > > > > > > > > (Iceberg REST spec — retries with the same key must not
> > > produce
> > > > > > > > > additional
> > > > > > > > > > side effects), and I'd like a wider review before
> updating
> > > the
> > > > > > > > > > implementation PR (#4269 <
> > > > > > > https://github.com/apache/polaris/pull/4269
> > > > > > > > >).
> > > > > > > > > >
> > > > > > > > > > What changed
> > > > > > > > > >
> > > > > > > > > >   - Before (Model A, lease-based): reserve an idempotency
> > row
> > > > > > before
> > > > > > > > > doing
> > > > > > > > > > work → IN_PROGRESS / heartbeat → finalize after.
> > > > > > > > > >   - After (Model B, optimistic commit): run the handler
> > > first →
> > > > > > > record
> > > > > > > > > only
> > > > > > > > > > after a successful (2xx) outcome. The record stores
> > binding +
> > > > > > status,
> > > > > > > > not
> > > > > > > > > > the HTTP response body. Retries with the same key
> re-derive
> > > an
> > > > > > > > equivalent
> > > > > > > > > > response from current catalog state
> > > > > > > > > >     instead of replaying a stored payload.
> > > > > > > > > >
> > > > > > > > > > The design doc still compares Model A and Model B
> > > side-by-side
> > > > so
> > > > > > the
> > > > > > > > > > trade-offs are explicit. So far the discussion has been
> > > leaning
> > > > > > > toward
> > > > > > > > > > Model B — mutating REST operations only, 2xx-only
> > > persistence,
> > > > no
> > > > > > > > > > response-body storage, and the known
> > > > > > > > > > trade-offs (e.g. concurrent first-request races; see the
> > > NOTES
> > > > > > > section
> > > > > > > > in
> > > > > > > > > > the doc).
> > > > > > > > > >
> > > > > > > > > > Does this direction look right before we lock in the
> > > > > > implementation?
> > > > > > > > > >
> > > > > > > > > > Comments on the doc
> > > > > > > > > > <
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1hqTejVyYXDpL5MJcVc7NyhCslKaGH82QoqMEcUYPvkE/edit?tab=t.0
> > > > > > > > > > >
> > > > > > > > > > or replies on this thread both work.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Huaxin
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Subject: [DISCUSS] Idempotency-Key design for Iceberg REST: converging on Model B

Reply via email to