Re: Subject: [DISCUSS] Idempotency-Key design for Iceberg REST: converging on Model B

huaxin gao Mon, 01 Jun 2026 18:58:07 -0700

Hi Dmitri,

I like the idea — the atomic key write closes the in-flight gap, and it
avoids the Iceberg metadata and spec issues. Agreed too that losing keys
on already-deleted entities is harmless.


But I think the harder case is delete operations themselves. For drop
table/view/namespace, the operation removes the entity, so there is no
surviving entity to hold the key. A retry of a successful drop should
return an equivalent success, but with entity-property storage the key
has nowhere to live — so the retry would just see "not found" and behave
differently. Where would a drop's key live in this model?

Thanks,
Huaxin

On Mon, Jun 1, 2026 at 6:13 PM Yufei Gu <[email protected]> wrote:

> One concern I have with storing idempotency records as entity properties is
> the potential performance impact. Over time, an entity could have a large
> number of idempotency key/value pairs. That would increase the entity's
> size, which may affect load, update, serialization, and caching costs for
> normal catalog operations, even when idempotency is not involved. Use cases
> such as table loading and entity in-memory caching could be affected.
> Before moving in that direction, I think it would be useful to better
> understand and measure the performance implications. If the entity size
> growth turns out to be negligible in practice, the approach may still be
> attractive because of its transactional simplicity.
>
> Yufei
>
>
> On Mon, Jun 1, 2026 at 2:17 PM Dmitri Bourlatchkov <[email protected]>
> wrote:
>
> > Hi Huaxin,
> >
> > How about storing idempotency keys in the Polaris Entity properties (not
> > Iceberg metadata)?
> >
> > I understand that entities can be deleted thus discarding previously
> > recorded keys, but based on the use cases discussed so far, it does not
> > look like deleted entities should be a functional concern.
> >
> > Storing idempotency keys inside the entity will ensure that their updates
> > are processed in the same logical change set as the entity changes from
> the
> > IRC request payload.
> >
> > This will ensure uniform operations across all Persistence
> implementations
> > and will not require any Idempotency-specific Persistence changes.
> >
> > WDYT?
> >
> > Thanks,
> > Dmitri.
> >
> > On Sun, May 31, 2026 at 2:35 PM huaxin gao <[email protected]>
> wrote:
> >
> > > Hi Dmitri, Robert,
> > >
> > > Thanks both.
> > >
> > > Dmitri — I agree with both of your points.
> > >
> > >   - Idempotency storage will stay separate from the metastore. It will
> > >     be separate in code and in transactions. We make the idempotency
> > >     decision before the handler runs, or after it commits — never
> inside
> > >     the metastore transaction.
> > >   - I'll document the assumption you raised. Model B is only as strict
> as
> > >     the spec wants if the client builds the request so that at most one
> > >     try can commit (for example, update requirements). The catalog's
> > >     optimistic concurrency makes sure of this. Model B just records the
> > >     result on top of it. I'll say this clearly in the Polaris docs.
> > >
> > > Robert — I see why the operation-id-in-metadata idea is appealing. If
> we
> > > write the id inside the commit, it is atomic with the change. That
> would
> > > close the in-flight gap for table and view operations. That is a real
> > > plus.
> > >
> > > But I don't think we should put the idempotency key in table metadata.
> > > Here is why:
> > >
> > > 1. It only works for table and view operations. It can't help namespace
> > >    operations, grants, or other writes. A separate store handles all of
> > >    them with one mechanism.
> > >
> > > 2. It mixes two concerns. Idempotency is a REST/catalog concern. Table
> > >    metadata should describe the table — schema, snapshots,
> partitioning,
> > >    sort order. A per-request id is not table state. I'd rather not mix
> > >    the two.
> > >
> > > 3. It bloats the metadata. To support retries we'd have to keep
> > >    operation-ids with some retention/TTL. metadata.json is rewritten on
> > >    every commit and read on every table load. For tables with many
> > >    writes, this adds real cost. And every client and engine that reads
> > >    the table pays it, not just the idempotency path.
> > >
> > > 4. It doesn't match the spec. The Iceberg REST spec defines idempotency
> > >    at the protocol layer — an Idempotency-Key header with a server-side
> > >    contract. It does not store idempotency in table metadata. Putting
> an
> > >    operation-id there would be a new mechanism that isn't in the spec
> > >    today. So it's a change to how the spec
> > >    works, and a cross-project change too.
> > >
> > > So I'd prefer to keep the record in a separate idempotency store. We
> > > accept the in-flight gap, but it is bounded. The catalog's optimistic
> > > concurrency stops a duplicate commit from landing. And once a record
> > > exists, retries replay cleanly.
> > >
> > > Thanks,
> > > Huaxin
> > >
> > > On Sat, May 30, 2026 at 3:15 AM Robert Stupp <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Thanks for the clarifications. Russell's explanation is especially
> > > useful.
> > > > I agree, ambiguous request outcomes, for example, timeouts or network
> > > > connections being reset, are hard to reason about.
> > > >
> > > > Clients often cannot reliably reconcile from the current state alone
> > for
> > > > table/view state mutating operations.
> > > >
> > > > I wonder whether the idempotency key should be recorded in the
> > table/view
> > > > metadata as an "operation-id", with an explicit retention guarantee,
> > > maybe
> > > > tied to a server-provided minimum TTL.
> > > > The approach could reduce or change the role of a separate
> > > > idempotency-record table and handling of it.
> > > >
> > > > Request handling could roughly look like this:
> > > >   if the current history/metadata already contains that
> "operation-id",
> > > >     return equivalent-enough response without re-running the
> operation.
> > > >
> > > >   try the committing operation:
> > > >   if the commit succeeds:
> > > >     record the "operation-id" in the table/view metadata, and
> > > >     return the successful response.
> > > >   if the commit runs into a conflict:
> > > >     re-check whether the current metadata/history contains that
> > > > "operation-id"
> > > >     if so:
> > > >       return equivalent-enough response.
> > > >     otherwise:
> > > >       return the conflict response.
> > > >
> > > > This is not perfect either and needs spec work, retention rules, and
> > may
> > > > only work for table and view operations.
> > > >
> > > > I mostly want to separate the questions:
> > > > 1. What guarantees do clients actually need after an ambiguous
> outcome?
> > > > 2. Where should the durable evidence for the guarantee live?
> > > >
> > > > Robert
> > > >
> > > > On Sat, May 30, 2026 at 4:30 AM Dmitri Bourlatchkov <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi Russell,
> > > > >
> > > > > Thanks for the information! It clarifies the use case a lot (at
> least
> > > for
> > > > > me :)
> > > > >
> > > > > In short, I'd say the main benefit is allowing clients to avoid
> > > conflicts
> > > > > (409) on re-submitting changes that got committed by the server
> > without
> > > > the
> > > > > client receiving confirmation of the success.
> > > > >
> > > > > I believe the Iceberg REST Catalog spec [1] is formally stricter
> than
> > > > Model
> > > > > B when it states "the server ensures no additional effects for
> > requests
> > > > > that carry the same Idempotency-Key". Since Model B permits request
> > > > > re-execution, the possibility of additional side effects cannot be
> > > ruled
> > > > > out completely based on the proposed server-side algorithm alone.
> The
> > > > > server must assume that the client forms the (change) request in
> > such a
> > > > way
> > > > > that only one execution attempt can succeed (e.g. by using "update
> > > > > requirements"). This is also mentioned in  comments on the doc [2].
> > > > >
> > > > > This is probably worth mentioning in the Polaris docs related to
> > > > > our Idempotency-Key implementation.
> > > > >
> > > > > Assuming this kind of cooperation on the client side, I believe
> > Model B
> > > > can
> > > > > be considered compliant with the spec [1].
> > > > >
> > > > > In anticipation of fresh implementation PRs for this feature, I'd
> > like
> > > to
> > > > > re-emphasize (IIRC I mentioned this before) that, I think, we
> should
> > > > avoid
> > > > > coupling Idempotency persistence with MetaStore persistence (both
> > > > code-wise
> > > > > and transaction-wise). Model B processes Idempotency-related data
> > > outside
> > > > > the original change request's execution scope. Idempotency
> decisions
> > > are
> > > > > made either before the request starts executing or after it is
> > > committed
> > > > to
> > > > > the MetaStore.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/polaris/blob/4e4eaf840bf71d431b13034b0dd6f338261d8e8b/spec/iceberg-rest-catalog-open-api.yaml#L2098
> > > > >
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1hqTejVyYXDpL5MJcVc7NyhCslKaGH82QoqMEcUYPvkE/edit?tab=t.0
> > > > >
> > > > > Cheers,
> > > > > Dmitri.
> > > > >
> > > > > On Fri, May 29, 2026 at 8:26 PM Russell Spitzer <
> > > > [email protected]
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > The problem with a client attempting to determine if it’s
> > operations
> > > > > > succeeded via  load table, and the reason all this work has
> > > proceeded,
> > > > is
> > > > > > that there is no way for a client to guaranteed path to actually
> > > > > determine
> > > > > > if a commit occurred. There are too many legitimate mechanisms to
> > > erase
> > > > > > history from an Iceberg table to guarantee an operation occurred.
> > > > > >
> > > > > > For example, you could check if your snapshot exists in snapshot
> > > > history
> > > > > > but this could have been erased by expire snapshots.
> > > > > >
> > > > > > Or you could check if the schema was modified according to your
> > > update,
> > > > > but
> > > > > > this too could have been undone by another operation. Client A
> adds
> > > > > column
> > > > > > but gets time out, Client B removes the Column, Client A retries
> > and
> > > > adds
> > > > > > the column again.
> > > > > >
> > > > > > Because of this the Iceberg client usually just bails out to he
> > user
> > > > with
> > > > > > an exception if it doesn’t get an actual confirmation that the
> > commit
> > > > > > succeeded from the server. This leaves the “can I retry or not”
> as
> > an
> > > > > > exercise to the end user.
> > > > > >
> > > > > > In practice, actual Iceberg users work around this sort of thing
> by
> > > > > adding
> > > > > > all sorts of custom metadata to hopefully persist history in the
> > > table
> > > > > > itself in some way that can’t be touched by expire snapshots, but
> > > this
> > > > is
> > > > > > usually very fragile and also relies on all clients behaving
> well.
> > > I’ve
> > > > > > seen folks use custom table properties for example “batch-5:
> > > committed”
> > > > > > then manually have their own retry logic check whether this
> > property
> > > is
> > > > > > set. Then, of course, they also have to add a bunch custom logic
> to
> > > > make
> > > > > > sure they clean up this state as well.
> > > > > >
> > > > > > This is why Iceberg added the Idempotency path in the first
> place,
> > it
> > > > > gives
> > > > > > us a guaranteed way for clients to retry in case of a network
> issue
> > > or
> > > > > > catalog issue with a guarantee they will not do duplicate work be
> > > > > retrying.
> > > > > > With this in place the client can now cleanly retry (within the
> > > > > idempotency
> > > > > > window) the same operation over and over without throwing an
> > > exception
> > > > to
> > > > > > the end user. Only in a situation where the catalog cannot
> respond
> > > > over a
> > > > > > very long time will the user actually have to do some sort of
> > > > > > reconciliation. You can look at the history of the Iceberg
> client’s
> > > > retry
> > > > > > behavior with ambiguous server side or network errors to see how
> > this
> > > > has
> > > > > > been a problem in the past.
> > > > > >
> > > > > > On Fri, May 29, 2026 at 1:24 PM huaxin gao <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Robert,
> > > > > > >
> > > > > > > Thanks for your reply!
> > > > > > >
> > > > > > > You're right that Model B does not prevent duplicate execution.
> > The
> > > > > > > record is written only after success. So if a client times out
> > > while
> > > > > the
> > > > > > > first request is still running, a retry can run the handler
> > again.
> > > > > There
> > > > > > > is no record yet to stop it. So Model B is "remember and
> replay a
> > > > > > > successful result," not "run exactly once."
> > > > > > >
> > > > > > > On the trade-off: Model A gives a stronger guarantee, but it
> > needs
> > > > > > > reserve/heartbeat/purge state, which adds complexity and
> > overhead.
> > > > > Model
> > > > > > > B is simpler and cheaper. The window it leaves open is small,
> > and a
> > > > > > > client only retries after a timeout, so racing first requests
> > > should
> > > > be
> > > > > > > rare in practice. Every design is a trade-off, and my view is
> > that
> > > > > Model
> > > > > > > B is the right one here.
> > > > > > >
> > > > > > > It also helps to be clear about where duplicate-work protection
> > > > really
> > > > > > > comes from. It comes from the catalog itself, not from
> > idempotency.
> > > > The
> > > > > > > catalog uses optimistic concurrency. If wo first attempts race,
> > at
> > > > most
> > > > > > > one commit wins and the other gets a 409. Idempotency sits on
> top
> > > of
> > > > > > that.
> > > > > > > It does not replace it.
> > > > > > >
> > > > > > > So what does Model B add over "the client just calls loadTable
> > and
> > > > > > > reconciles"? Two things that I think are real:
> > > > > > >
> > > > > > >   1. The 422 check. loadTable can tell a client that a table
> > > exists.
> > > > It
> > > > > > >      cannot tell the client that the table THEY created with
> THIS
> > > key
> > > > > is
> > > > > > >      the one that succeeded. The record binds the key to
> > > (principal,
> > > > > > >      operation, resource). If the same key is reused for a
> > > different
> > > > > > >      request, the server returns 422. The client cannot detect
> > this
> > > > on
> > > > > > >      its own.
> > > > > > >
> > > > > > >   2. One server-side behavior for all mutating ops.
> create-table
> > > > > happens
> > > > > > >      to reconcile cleanly with loadTable. But the point of the
> > > > > > >      Idempotency-Key header is that the client should not have
> to
> > > > write
> > > > > > >      reconciliation logic for every operation. For a known key,
> > the
> > > > > > >      server turns what would be a 409 into an equivalent 2xx
> > > replay.
> > > > > The
> > > > > > >      client gets a clean success instead of an error it has to
> > > > special-
> > > > > > >      case.
> > > > > > >
> > > > > > > There is a third, weaker benefit: once a record exists, retries
> > > stop
> > > > > > > seeing flip-flopping results. But that only helps after a
> record
> > > > > exists,
> > > > > > > which is exactly the window you pointed out is unprotected.
> > > > > > >
> > > > > > > So I'll correct my earlier wording. This is not convergence on
> > > > exactly-
> > > > > > > once idempotency. It is a narrower guarantee: replay a recorded
> > > > result,
> > > > > > > plus detect key misuse. It sits on top of the catalog's
> existing
> > > > > > > concurrency control. The real question for the list is simple:
> is
> > > > that
> > > > > > > narrower guarantee worth shipping on its own? Or do we need
> Model
> > > A's
> > > > > > > in-flight protection to have a strong idempotency guarantee?
> > > > > > >
> > > > > > > My view is that the narrow version is worth it for now: it's
> the
> > > > > > > behavior the spec asks for, the 422 check can't be done
> > > client-side,
> > > > > and
> > > > > > > it's a small change we can strengthen toward Model A later if a
> > > real
> > > > > use
> > > > > > > case needs it. Happy to hear what others think.
> > > > > > >
> > > > > > > Best,
> > > > > > > Huaxin
> > > > > > >
> > > > > > > On Fri, May 29, 2026 at 7:36 AM Robert Stupp <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > > > Hi Huaxin,
> > > > > > > >
> > > > > > > > Thanks for writing this up and moving the design discussion
> > back
> > > to
> > > > > > dev@
> > > > > > > .
> > > > > > > >
> > > > > > > > Since you’re asking before locking in the implementation, I
> > think
> > > > we
> > > > > > > should
> > > > > > > > clarify one point.
> > > > > > > >
> > > > > > > > Model B is certainly simpler than the lease-based approach,
> but
> > > I’m
> > > > > not
> > > > > > > > sure I fully understand what problem it still solves.
> > > > > > > >
> > > > > > > > As I read it, if a client times out while the original
> request
> > is
> > > > > still
> > > > > > > > running, a retry with the same key may not see an idempotency
> > > > record
> > > > > > yet
> > > > > > > > and could run the handler again.
> > > > > > > > So this feels less like preventing duplicate execution and
> more
> > > > like
> > > > > > > > remembering a successful result after the fact.
> > > > > > > >
> > > > > > > > For the create-table case, couldn’t a client achieve roughly
> > the
> > > > same
> > > > > > > > recovery by calling loadTable after an ambiguous timeout and
> > > > > > reconciling
> > > > > > > > from there?
> > > > > > > > Since Model B also rebuilds the response from current catalog
> > > > state,
> > > > > > I’m
> > > > > > > > trying to understand what it gives us beyond that.
> > > > > > > >
> > > > > > > > I’m not against simplifying the design, but I think we should
> > be
> > > > > clear
> > > > > > > > about the narrower guarantee before calling this convergence.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Robert
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, May 29, 2026 at 12:29 AM huaxin gao <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I've simplified the proposed design for Idempotency-Key
> > support
> > > > in
> > > > > > > > Polaris
> > > > > > > > > (Iceberg REST spec — retries with the same key must not
> > produce
> > > > > > > > additional
> > > > > > > > > side effects), and I'd like a wider review before updating
> > the
> > > > > > > > > implementation PR (#4269 <
> > > > > > https://github.com/apache/polaris/pull/4269
> > > > > > > >).
> > > > > > > > >
> > > > > > > > > What changed
> > > > > > > > >
> > > > > > > > >   - Before (Model A, lease-based): reserve an idempotency
> row
> > > > > before
> > > > > > > > doing
> > > > > > > > > work → IN_PROGRESS / heartbeat → finalize after.
> > > > > > > > >   - After (Model B, optimistic commit): run the handler
> > first →
> > > > > > record
> > > > > > > > only
> > > > > > > > > after a successful (2xx) outcome. The record stores
> binding +
> > > > > status,
> > > > > > > not
> > > > > > > > > the HTTP response body. Retries with the same key re-derive
> > an
> > > > > > > equivalent
> > > > > > > > > response from current catalog state
> > > > > > > > >     instead of replaying a stored payload.
> > > > > > > > >
> > > > > > > > > The design doc still compares Model A and Model B
> > side-by-side
> > > so
> > > > > the
> > > > > > > > > trade-offs are explicit. So far the discussion has been
> > leaning
> > > > > > toward
> > > > > > > > > Model B — mutating REST operations only, 2xx-only
> > persistence,
> > > no
> > > > > > > > > response-body storage, and the known
> > > > > > > > > trade-offs (e.g. concurrent first-request races; see the
> > NOTES
> > > > > > section
> > > > > > > in
> > > > > > > > > the doc).
> > > > > > > > >
> > > > > > > > > Does this direction look right before we lock in the
> > > > > implementation?
> > > > > > > > >
> > > > > > > > > Comments on the doc
> > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1hqTejVyYXDpL5MJcVc7NyhCslKaGH82QoqMEcUYPvkE/edit?tab=t.0
> > > > > > > > > >
> > > > > > > > > or replies on this thread both work.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Huaxin
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Subject: [DISCUSS] Idempotency-Key design for Iceberg REST: converging on Model B

Reply via email to