Re: Subject: [DISCUSS] Idempotency-Key design for Iceberg REST: converging on Model B

huaxin gao Sun, 31 May 2026 11:35:44 -0700

Hi Dmitri, Robert,

Thanks both.


Dmitri — I agree with both of your points.

  - Idempotency storage will stay separate from the metastore. It will
    be separate in code and in transactions. We make the idempotency
    decision before the handler runs, or after it commits — never inside
    the metastore transaction.
  - I'll document the assumption you raised. Model B is only as strict as
    the spec wants if the client builds the request so that at most one
    try can commit (for example, update requirements). The catalog's
    optimistic concurrency makes sure of this. Model B just records the
    result on top of it. I'll say this clearly in the Polaris docs.

Robert — I see why the operation-id-in-metadata idea is appealing. If we
write the id inside the commit, it is atomic with the change. That would
close the in-flight gap for table and view operations. That is a real
plus.

But I don't think we should put the idempotency key in table metadata.
Here is why:

1. It only works for table and view operations. It can't help namespace
   operations, grants, or other writes. A separate store handles all of
   them with one mechanism.

2. It mixes two concerns. Idempotency is a REST/catalog concern. Table
   metadata should describe the table — schema, snapshots, partitioning,
   sort order. A per-request id is not table state. I'd rather not mix
   the two.

3. It bloats the metadata. To support retries we'd have to keep
   operation-ids with some retention/TTL. metadata.json is rewritten on
   every commit and read on every table load. For tables with many
   writes, this adds real cost. And every client and engine that reads
   the table pays it, not just the idempotency path.

4. It doesn't match the spec. The Iceberg REST spec defines idempotency
   at the protocol layer — an Idempotency-Key header with a server-side
   contract. It does not store idempotency in table metadata. Putting an
   operation-id there would be a new mechanism that isn't in the spec
   today. So it's a change to how the spec
   works, and a cross-project change too.

So I'd prefer to keep the record in a separate idempotency store. We
accept the in-flight gap, but it is bounded. The catalog's optimistic
concurrency stops a duplicate commit from landing. And once a record
exists, retries replay cleanly.

Thanks,
Huaxin

On Sat, May 30, 2026 at 3:15 AM Robert Stupp <[email protected]> wrote:

> Hi all,
>
> Thanks for the clarifications. Russell's explanation is especially useful.
> I agree, ambiguous request outcomes, for example, timeouts or network
> connections being reset, are hard to reason about.
>
> Clients often cannot reliably reconcile from the current state alone for
> table/view state mutating operations.
>
> I wonder whether the idempotency key should be recorded in the table/view
> metadata as an "operation-id", with an explicit retention guarantee, maybe
> tied to a server-provided minimum TTL.
> The approach could reduce or change the role of a separate
> idempotency-record table and handling of it.
>
> Request handling could roughly look like this:
>   if the current history/metadata already contains that "operation-id",
>     return equivalent-enough response without re-running the operation.
>
>   try the committing operation:
>   if the commit succeeds:
>     record the "operation-id" in the table/view metadata, and
>     return the successful response.
>   if the commit runs into a conflict:
>     re-check whether the current metadata/history contains that
> "operation-id"
>     if so:
>       return equivalent-enough response.
>     otherwise:
>       return the conflict response.
>
> This is not perfect either and needs spec work, retention rules, and may
> only work for table and view operations.
>
> I mostly want to separate the questions:
> 1. What guarantees do clients actually need after an ambiguous outcome?
> 2. Where should the durable evidence for the guarantee live?
>
> Robert
>
> On Sat, May 30, 2026 at 4:30 AM Dmitri Bourlatchkov <[email protected]>
> wrote:
>
> > Hi Russell,
> >
> > Thanks for the information! It clarifies the use case a lot (at least for
> > me :)
> >
> > In short, I'd say the main benefit is allowing clients to avoid conflicts
> > (409) on re-submitting changes that got committed by the server without
> the
> > client receiving confirmation of the success.
> >
> > I believe the Iceberg REST Catalog spec [1] is formally stricter than
> Model
> > B when it states "the server ensures no additional effects for requests
> > that carry the same Idempotency-Key". Since Model B permits request
> > re-execution, the possibility of additional side effects cannot be ruled
> > out completely based on the proposed server-side algorithm alone. The
> > server must assume that the client forms the (change) request in such a
> way
> > that only one execution attempt can succeed (e.g. by using "update
> > requirements"). This is also mentioned in  comments on the doc [2].
> >
> > This is probably worth mentioning in the Polaris docs related to
> > our Idempotency-Key implementation.
> >
> > Assuming this kind of cooperation on the client side, I believe Model B
> can
> > be considered compliant with the spec [1].
> >
> > In anticipation of fresh implementation PRs for this feature, I'd like to
> > re-emphasize (IIRC I mentioned this before) that, I think, we should
> avoid
> > coupling Idempotency persistence with MetaStore persistence (both
> code-wise
> > and transaction-wise). Model B processes Idempotency-related data outside
> > the original change request's execution scope. Idempotency decisions are
> > made either before the request starts executing or after it is committed
> to
> > the MetaStore.
> >
> > [1]
> >
> >
> https://github.com/apache/polaris/blob/4e4eaf840bf71d431b13034b0dd6f338261d8e8b/spec/iceberg-rest-catalog-open-api.yaml#L2098
> >
> > [2]
> >
> >
> https://docs.google.com/document/d/1hqTejVyYXDpL5MJcVc7NyhCslKaGH82QoqMEcUYPvkE/edit?tab=t.0
> >
> > Cheers,
> > Dmitri.
> >
> > On Fri, May 29, 2026 at 8:26 PM Russell Spitzer <
> [email protected]
> > >
> > wrote:
> >
> > > The problem with a client attempting to determine if it’s operations
> > > succeeded via  load table, and the reason all this work has proceeded,
> is
> > > that there is no way for a client to guaranteed path to actually
> > determine
> > > if a commit occurred. There are too many legitimate mechanisms to erase
> > > history from an Iceberg table to guarantee an operation occurred.
> > >
> > > For example, you could check if your snapshot exists in snapshot
> history
> > > but this could have been erased by expire snapshots.
> > >
> > > Or you could check if the schema was modified according to your update,
> > but
> > > this too could have been undone by another operation. Client A adds
> > column
> > > but gets time out, Client B removes the Column, Client A retries and
> adds
> > > the column again.
> > >
> > > Because of this the Iceberg client usually just bails out to he user
> with
> > > an exception if it doesn’t get an actual confirmation that the commit
> > > succeeded from the server. This leaves the “can I retry or not” as an
> > > exercise to the end user.
> > >
> > > In practice, actual Iceberg users work around this sort of thing by
> > adding
> > > all sorts of custom metadata to hopefully persist history in the table
> > > itself in some way that can’t be touched by expire snapshots, but this
> is
> > > usually very fragile and also relies on all clients behaving well. I’ve
> > > seen folks use custom table properties for example “batch-5: committed”
> > > then manually have their own retry logic check whether this property is
> > > set. Then, of course, they also have to add a bunch custom logic to
> make
> > > sure they clean up this state as well.
> > >
> > > This is why Iceberg added the Idempotency path in the first place, it
> > gives
> > > us a guaranteed way for clients to retry in case of a network issue or
> > > catalog issue with a guarantee they will not do duplicate work be
> > retrying.
> > > With this in place the client can now cleanly retry (within the
> > idempotency
> > > window) the same operation over and over without throwing an exception
> to
> > > the end user. Only in a situation where the catalog cannot respond
> over a
> > > very long time will the user actually have to do some sort of
> > > reconciliation. You can look at the history of the Iceberg client’s
> retry
> > > behavior with ambiguous server side or network errors to see how this
> has
> > > been a problem in the past.
> > >
> > > On Fri, May 29, 2026 at 1:24 PM huaxin gao <[email protected]>
> > wrote:
> > >
> > > > Hi Robert,
> > > >
> > > > Thanks for your reply!
> > > >
> > > > You're right that Model B does not prevent duplicate execution. The
> > > > record is written only after success. So if a client times out while
> > the
> > > > first request is still running, a retry can run the handler again.
> > There
> > > > is no record yet to stop it. So Model B is "remember and replay a
> > > > successful result," not "run exactly once."
> > > >
> > > > On the trade-off: Model A gives a stronger guarantee, but it needs
> > > > reserve/heartbeat/purge state, which adds complexity and overhead.
> > Model
> > > > B is simpler and cheaper. The window it leaves open is small, and a
> > > > client only retries after a timeout, so racing first requests should
> be
> > > > rare in practice. Every design is a trade-off, and my view is that
> > Model
> > > > B is the right one here.
> > > >
> > > > It also helps to be clear about where duplicate-work protection
> really
> > > > comes from. It comes from the catalog itself, not from idempotency.
> The
> > > > catalog uses optimistic concurrency. If wo first attempts race, at
> most
> > > > one commit wins and the other gets a 409. Idempotency sits on top of
> > > that.
> > > > It does not replace it.
> > > >
> > > > So what does Model B add over "the client just calls loadTable and
> > > > reconciles"? Two things that I think are real:
> > > >
> > > >   1. The 422 check. loadTable can tell a client that a table exists.
> It
> > > >      cannot tell the client that the table THEY created with THIS key
> > is
> > > >      the one that succeeded. The record binds the key to (principal,
> > > >      operation, resource). If the same key is reused for a different
> > > >      request, the server returns 422. The client cannot detect this
> on
> > > >      its own.
> > > >
> > > >   2. One server-side behavior for all mutating ops. create-table
> > happens
> > > >      to reconcile cleanly with loadTable. But the point of the
> > > >      Idempotency-Key header is that the client should not have to
> write
> > > >      reconciliation logic for every operation. For a known key, the
> > > >      server turns what would be a 409 into an equivalent 2xx replay.
> > The
> > > >      client gets a clean success instead of an error it has to
> special-
> > > >      case.
> > > >
> > > > There is a third, weaker benefit: once a record exists, retries stop
> > > > seeing flip-flopping results. But that only helps after a record
> > exists,
> > > > which is exactly the window you pointed out is unprotected.
> > > >
> > > > So I'll correct my earlier wording. This is not convergence on
> exactly-
> > > > once idempotency. It is a narrower guarantee: replay a recorded
> result,
> > > > plus detect key misuse. It sits on top of the catalog's existing
> > > > concurrency control. The real question for the list is simple: is
> that
> > > > narrower guarantee worth shipping on its own? Or do we need Model A's
> > > > in-flight protection to have a strong idempotency guarantee?
> > > >
> > > > My view is that the narrow version is worth it for now: it's the
> > > > behavior the spec asks for, the 422 check can't be done client-side,
> > and
> > > > it's a small change we can strengthen toward Model A later if a real
> > use
> > > > case needs it. Happy to hear what others think.
> > > >
> > > > Best,
> > > > Huaxin
> > > >
> > > > On Fri, May 29, 2026 at 7:36 AM Robert Stupp <[email protected]> wrote:
> > > >
> > > > > Hi Huaxin,
> > > > >
> > > > > Thanks for writing this up and moving the design discussion back to
> > > dev@
> > > > .
> > > > >
> > > > > Since you’re asking before locking in the implementation, I think
> we
> > > > should
> > > > > clarify one point.
> > > > >
> > > > > Model B is certainly simpler than the lease-based approach, but I’m
> > not
> > > > > sure I fully understand what problem it still solves.
> > > > >
> > > > > As I read it, if a client times out while the original request is
> > still
> > > > > running, a retry with the same key may not see an idempotency
> record
> > > yet
> > > > > and could run the handler again.
> > > > > So this feels less like preventing duplicate execution and more
> like
> > > > > remembering a successful result after the fact.
> > > > >
> > > > > For the create-table case, couldn’t a client achieve roughly the
> same
> > > > > recovery by calling loadTable after an ambiguous timeout and
> > > reconciling
> > > > > from there?
> > > > > Since Model B also rebuilds the response from current catalog
> state,
> > > I’m
> > > > > trying to understand what it gives us beyond that.
> > > > >
> > > > > I’m not against simplifying the design, but I think we should be
> > clear
> > > > > about the narrower guarantee before calling this convergence.
> > > > >
> > > > > Best,
> > > > > Robert
> > > > >
> > > > >
> > > > > On Fri, May 29, 2026 at 12:29 AM huaxin gao <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I've simplified the proposed design for Idempotency-Key support
> in
> > > > > Polaris
> > > > > > (Iceberg REST spec — retries with the same key must not produce
> > > > > additional
> > > > > > side effects), and I'd like a wider review before updating the
> > > > > > implementation PR (#4269 <
> > > https://github.com/apache/polaris/pull/4269
> > > > >).
> > > > > >
> > > > > > What changed
> > > > > >
> > > > > >   - Before (Model A, lease-based): reserve an idempotency row
> > before
> > > > > doing
> > > > > > work → IN_PROGRESS / heartbeat → finalize after.
> > > > > >   - After (Model B, optimistic commit): run the handler first →
> > > record
> > > > > only
> > > > > > after a successful (2xx) outcome. The record stores binding +
> > status,
> > > > not
> > > > > > the HTTP response body. Retries with the same key re-derive an
> > > > equivalent
> > > > > > response from current catalog state
> > > > > >     instead of replaying a stored payload.
> > > > > >
> > > > > > The design doc still compares Model A and Model B side-by-side so
> > the
> > > > > > trade-offs are explicit. So far the discussion has been leaning
> > > toward
> > > > > > Model B — mutating REST operations only, 2xx-only persistence, no
> > > > > > response-body storage, and the known
> > > > > > trade-offs (e.g. concurrent first-request races; see the NOTES
> > > section
> > > > in
> > > > > > the doc).
> > > > > >
> > > > > > Does this direction look right before we lock in the
> > implementation?
> > > > > >
> > > > > > Comments on the doc
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1hqTejVyYXDpL5MJcVc7NyhCslKaGH82QoqMEcUYPvkE/edit?tab=t.0
> > > > > > >
> > > > > > or replies on this thread both work.
> > > > > >
> > > > > > Thanks,
> > > > > > Huaxin
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Subject: [DISCUSS] Idempotency-Key design for Iceberg REST: converging on Model B

Reply via email to