Hi Dmitri, I like the idea — the atomic key write closes the in-flight gap, and it avoids the Iceberg metadata and spec issues. Agreed too that losing keys on already-deleted entities is harmless.
But I think the harder case is delete operations themselves. For drop table/view/namespace, the operation removes the entity, so there is no surviving entity to hold the key. A retry of a successful drop should return an equivalent success, but with entity-property storage the key has nowhere to live — so the retry would just see "not found" and behave differently. Where would a drop's key live in this model? Thanks, Huaxin On Mon, Jun 1, 2026 at 6:13 PM Yufei Gu <[email protected]> wrote: > One concern I have with storing idempotency records as entity properties is > the potential performance impact. Over time, an entity could have a large > number of idempotency key/value pairs. That would increase the entity's > size, which may affect load, update, serialization, and caching costs for > normal catalog operations, even when idempotency is not involved. Use cases > such as table loading and entity in-memory caching could be affected. > Before moving in that direction, I think it would be useful to better > understand and measure the performance implications. If the entity size > growth turns out to be negligible in practice, the approach may still be > attractive because of its transactional simplicity. > > Yufei > > > On Mon, Jun 1, 2026 at 2:17 PM Dmitri Bourlatchkov <[email protected]> > wrote: > > > Hi Huaxin, > > > > How about storing idempotency keys in the Polaris Entity properties (not > > Iceberg metadata)? > > > > I understand that entities can be deleted thus discarding previously > > recorded keys, but based on the use cases discussed so far, it does not > > look like deleted entities should be a functional concern. > > > > Storing idempotency keys inside the entity will ensure that their updates > > are processed in the same logical change set as the entity changes from > the > > IRC request payload. > > > > This will ensure uniform operations across all Persistence > implementations > > and will not require any Idempotency-specific Persistence changes. > > > > WDYT? > > > > Thanks, > > Dmitri. > > > > On Sun, May 31, 2026 at 2:35 PM huaxin gao <[email protected]> > wrote: > > > > > Hi Dmitri, Robert, > > > > > > Thanks both. > > > > > > Dmitri — I agree with both of your points. > > > > > > - Idempotency storage will stay separate from the metastore. It will > > > be separate in code and in transactions. We make the idempotency > > > decision before the handler runs, or after it commits — never > inside > > > the metastore transaction. > > > - I'll document the assumption you raised. Model B is only as strict > as > > > the spec wants if the client builds the request so that at most one > > > try can commit (for example, update requirements). The catalog's > > > optimistic concurrency makes sure of this. Model B just records the > > > result on top of it. I'll say this clearly in the Polaris docs. > > > > > > Robert — I see why the operation-id-in-metadata idea is appealing. If > we > > > write the id inside the commit, it is atomic with the change. That > would > > > close the in-flight gap for table and view operations. That is a real > > > plus. > > > > > > But I don't think we should put the idempotency key in table metadata. > > > Here is why: > > > > > > 1. It only works for table and view operations. It can't help namespace > > > operations, grants, or other writes. A separate store handles all of > > > them with one mechanism. > > > > > > 2. It mixes two concerns. Idempotency is a REST/catalog concern. Table > > > metadata should describe the table — schema, snapshots, > partitioning, > > > sort order. A per-request id is not table state. I'd rather not mix > > > the two. > > > > > > 3. It bloats the metadata. To support retries we'd have to keep > > > operation-ids with some retention/TTL. metadata.json is rewritten on > > > every commit and read on every table load. For tables with many > > > writes, this adds real cost. And every client and engine that reads > > > the table pays it, not just the idempotency path. > > > > > > 4. It doesn't match the spec. The Iceberg REST spec defines idempotency > > > at the protocol layer — an Idempotency-Key header with a server-side > > > contract. It does not store idempotency in table metadata. Putting > an > > > operation-id there would be a new mechanism that isn't in the spec > > > today. So it's a change to how the spec > > > works, and a cross-project change too. > > > > > > So I'd prefer to keep the record in a separate idempotency store. We > > > accept the in-flight gap, but it is bounded. The catalog's optimistic > > > concurrency stops a duplicate commit from landing. And once a record > > > exists, retries replay cleanly. > > > > > > Thanks, > > > Huaxin > > > > > > On Sat, May 30, 2026 at 3:15 AM Robert Stupp <[email protected]> wrote: > > > > > > > Hi all, > > > > > > > > Thanks for the clarifications. Russell's explanation is especially > > > useful. > > > > I agree, ambiguous request outcomes, for example, timeouts or network > > > > connections being reset, are hard to reason about. > > > > > > > > Clients often cannot reliably reconcile from the current state alone > > for > > > > table/view state mutating operations. > > > > > > > > I wonder whether the idempotency key should be recorded in the > > table/view > > > > metadata as an "operation-id", with an explicit retention guarantee, > > > maybe > > > > tied to a server-provided minimum TTL. > > > > The approach could reduce or change the role of a separate > > > > idempotency-record table and handling of it. > > > > > > > > Request handling could roughly look like this: > > > > if the current history/metadata already contains that > "operation-id", > > > > return equivalent-enough response without re-running the > operation. > > > > > > > > try the committing operation: > > > > if the commit succeeds: > > > > record the "operation-id" in the table/view metadata, and > > > > return the successful response. > > > > if the commit runs into a conflict: > > > > re-check whether the current metadata/history contains that > > > > "operation-id" > > > > if so: > > > > return equivalent-enough response. > > > > otherwise: > > > > return the conflict response. > > > > > > > > This is not perfect either and needs spec work, retention rules, and > > may > > > > only work for table and view operations. > > > > > > > > I mostly want to separate the questions: > > > > 1. What guarantees do clients actually need after an ambiguous > outcome? > > > > 2. Where should the durable evidence for the guarantee live? > > > > > > > > Robert > > > > > > > > On Sat, May 30, 2026 at 4:30 AM Dmitri Bourlatchkov < > [email protected]> > > > > wrote: > > > > > > > > > Hi Russell, > > > > > > > > > > Thanks for the information! It clarifies the use case a lot (at > least > > > for > > > > > me :) > > > > > > > > > > In short, I'd say the main benefit is allowing clients to avoid > > > conflicts > > > > > (409) on re-submitting changes that got committed by the server > > without > > > > the > > > > > client receiving confirmation of the success. > > > > > > > > > > I believe the Iceberg REST Catalog spec [1] is formally stricter > than > > > > Model > > > > > B when it states "the server ensures no additional effects for > > requests > > > > > that carry the same Idempotency-Key". Since Model B permits request > > > > > re-execution, the possibility of additional side effects cannot be > > > ruled > > > > > out completely based on the proposed server-side algorithm alone. > The > > > > > server must assume that the client forms the (change) request in > > such a > > > > way > > > > > that only one execution attempt can succeed (e.g. by using "update > > > > > requirements"). This is also mentioned in comments on the doc [2]. > > > > > > > > > > This is probably worth mentioning in the Polaris docs related to > > > > > our Idempotency-Key implementation. > > > > > > > > > > Assuming this kind of cooperation on the client side, I believe > > Model B > > > > can > > > > > be considered compliant with the spec [1]. > > > > > > > > > > In anticipation of fresh implementation PRs for this feature, I'd > > like > > > to > > > > > re-emphasize (IIRC I mentioned this before) that, I think, we > should > > > > avoid > > > > > coupling Idempotency persistence with MetaStore persistence (both > > > > code-wise > > > > > and transaction-wise). Model B processes Idempotency-related data > > > outside > > > > > the original change request's execution scope. Idempotency > decisions > > > are > > > > > made either before the request starts executing or after it is > > > committed > > > > to > > > > > the MetaStore. > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://github.com/apache/polaris/blob/4e4eaf840bf71d431b13034b0dd6f338261d8e8b/spec/iceberg-rest-catalog-open-api.yaml#L2098 > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1hqTejVyYXDpL5MJcVc7NyhCslKaGH82QoqMEcUYPvkE/edit?tab=t.0 > > > > > > > > > > Cheers, > > > > > Dmitri. > > > > > > > > > > On Fri, May 29, 2026 at 8:26 PM Russell Spitzer < > > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > The problem with a client attempting to determine if it’s > > operations > > > > > > succeeded via load table, and the reason all this work has > > > proceeded, > > > > is > > > > > > that there is no way for a client to guaranteed path to actually > > > > > determine > > > > > > if a commit occurred. There are too many legitimate mechanisms to > > > erase > > > > > > history from an Iceberg table to guarantee an operation occurred. > > > > > > > > > > > > For example, you could check if your snapshot exists in snapshot > > > > history > > > > > > but this could have been erased by expire snapshots. > > > > > > > > > > > > Or you could check if the schema was modified according to your > > > update, > > > > > but > > > > > > this too could have been undone by another operation. Client A > adds > > > > > column > > > > > > but gets time out, Client B removes the Column, Client A retries > > and > > > > adds > > > > > > the column again. > > > > > > > > > > > > Because of this the Iceberg client usually just bails out to he > > user > > > > with > > > > > > an exception if it doesn’t get an actual confirmation that the > > commit > > > > > > succeeded from the server. This leaves the “can I retry or not” > as > > an > > > > > > exercise to the end user. > > > > > > > > > > > > In practice, actual Iceberg users work around this sort of thing > by > > > > > adding > > > > > > all sorts of custom metadata to hopefully persist history in the > > > table > > > > > > itself in some way that can’t be touched by expire snapshots, but > > > this > > > > is > > > > > > usually very fragile and also relies on all clients behaving > well. > > > I’ve > > > > > > seen folks use custom table properties for example “batch-5: > > > committed” > > > > > > then manually have their own retry logic check whether this > > property > > > is > > > > > > set. Then, of course, they also have to add a bunch custom logic > to > > > > make > > > > > > sure they clean up this state as well. > > > > > > > > > > > > This is why Iceberg added the Idempotency path in the first > place, > > it > > > > > gives > > > > > > us a guaranteed way for clients to retry in case of a network > issue > > > or > > > > > > catalog issue with a guarantee they will not do duplicate work be > > > > > retrying. > > > > > > With this in place the client can now cleanly retry (within the > > > > > idempotency > > > > > > window) the same operation over and over without throwing an > > > exception > > > > to > > > > > > the end user. Only in a situation where the catalog cannot > respond > > > > over a > > > > > > very long time will the user actually have to do some sort of > > > > > > reconciliation. You can look at the history of the Iceberg > client’s > > > > retry > > > > > > behavior with ambiguous server side or network errors to see how > > this > > > > has > > > > > > been a problem in the past. > > > > > > > > > > > > On Fri, May 29, 2026 at 1:24 PM huaxin gao < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > Hi Robert, > > > > > > > > > > > > > > Thanks for your reply! > > > > > > > > > > > > > > You're right that Model B does not prevent duplicate execution. > > The > > > > > > > record is written only after success. So if a client times out > > > while > > > > > the > > > > > > > first request is still running, a retry can run the handler > > again. > > > > > There > > > > > > > is no record yet to stop it. So Model B is "remember and > replay a > > > > > > > successful result," not "run exactly once." > > > > > > > > > > > > > > On the trade-off: Model A gives a stronger guarantee, but it > > needs > > > > > > > reserve/heartbeat/purge state, which adds complexity and > > overhead. > > > > > Model > > > > > > > B is simpler and cheaper. The window it leaves open is small, > > and a > > > > > > > client only retries after a timeout, so racing first requests > > > should > > > > be > > > > > > > rare in practice. Every design is a trade-off, and my view is > > that > > > > > Model > > > > > > > B is the right one here. > > > > > > > > > > > > > > It also helps to be clear about where duplicate-work protection > > > > really > > > > > > > comes from. It comes from the catalog itself, not from > > idempotency. > > > > The > > > > > > > catalog uses optimistic concurrency. If wo first attempts race, > > at > > > > most > > > > > > > one commit wins and the other gets a 409. Idempotency sits on > top > > > of > > > > > > that. > > > > > > > It does not replace it. > > > > > > > > > > > > > > So what does Model B add over "the client just calls loadTable > > and > > > > > > > reconciles"? Two things that I think are real: > > > > > > > > > > > > > > 1. The 422 check. loadTable can tell a client that a table > > > exists. > > > > It > > > > > > > cannot tell the client that the table THEY created with > THIS > > > key > > > > > is > > > > > > > the one that succeeded. The record binds the key to > > > (principal, > > > > > > > operation, resource). If the same key is reused for a > > > different > > > > > > > request, the server returns 422. The client cannot detect > > this > > > > on > > > > > > > its own. > > > > > > > > > > > > > > 2. One server-side behavior for all mutating ops. > create-table > > > > > happens > > > > > > > to reconcile cleanly with loadTable. But the point of the > > > > > > > Idempotency-Key header is that the client should not have > to > > > > write > > > > > > > reconciliation logic for every operation. For a known key, > > the > > > > > > > server turns what would be a 409 into an equivalent 2xx > > > replay. > > > > > The > > > > > > > client gets a clean success instead of an error it has to > > > > special- > > > > > > > case. > > > > > > > > > > > > > > There is a third, weaker benefit: once a record exists, retries > > > stop > > > > > > > seeing flip-flopping results. But that only helps after a > record > > > > > exists, > > > > > > > which is exactly the window you pointed out is unprotected. > > > > > > > > > > > > > > So I'll correct my earlier wording. This is not convergence on > > > > exactly- > > > > > > > once idempotency. It is a narrower guarantee: replay a recorded > > > > result, > > > > > > > plus detect key misuse. It sits on top of the catalog's > existing > > > > > > > concurrency control. The real question for the list is simple: > is > > > > that > > > > > > > narrower guarantee worth shipping on its own? Or do we need > Model > > > A's > > > > > > > in-flight protection to have a strong idempotency guarantee? > > > > > > > > > > > > > > My view is that the narrow version is worth it for now: it's > the > > > > > > > behavior the spec asks for, the 422 check can't be done > > > client-side, > > > > > and > > > > > > > it's a small change we can strengthen toward Model A later if a > > > real > > > > > use > > > > > > > case needs it. Happy to hear what others think. > > > > > > > > > > > > > > Best, > > > > > > > Huaxin > > > > > > > > > > > > > > On Fri, May 29, 2026 at 7:36 AM Robert Stupp <[email protected]> > > > wrote: > > > > > > > > > > > > > > > Hi Huaxin, > > > > > > > > > > > > > > > > Thanks for writing this up and moving the design discussion > > back > > > to > > > > > > dev@ > > > > > > > . > > > > > > > > > > > > > > > > Since you’re asking before locking in the implementation, I > > think > > > > we > > > > > > > should > > > > > > > > clarify one point. > > > > > > > > > > > > > > > > Model B is certainly simpler than the lease-based approach, > but > > > I’m > > > > > not > > > > > > > > sure I fully understand what problem it still solves. > > > > > > > > > > > > > > > > As I read it, if a client times out while the original > request > > is > > > > > still > > > > > > > > running, a retry with the same key may not see an idempotency > > > > record > > > > > > yet > > > > > > > > and could run the handler again. > > > > > > > > So this feels less like preventing duplicate execution and > more > > > > like > > > > > > > > remembering a successful result after the fact. > > > > > > > > > > > > > > > > For the create-table case, couldn’t a client achieve roughly > > the > > > > same > > > > > > > > recovery by calling loadTable after an ambiguous timeout and > > > > > > reconciling > > > > > > > > from there? > > > > > > > > Since Model B also rebuilds the response from current catalog > > > > state, > > > > > > I’m > > > > > > > > trying to understand what it gives us beyond that. > > > > > > > > > > > > > > > > I’m not against simplifying the design, but I think we should > > be > > > > > clear > > > > > > > > about the narrower guarantee before calling this convergence. > > > > > > > > > > > > > > > > Best, > > > > > > > > Robert > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 29, 2026 at 12:29 AM huaxin gao < > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > I've simplified the proposed design for Idempotency-Key > > support > > > > in > > > > > > > > Polaris > > > > > > > > > (Iceberg REST spec — retries with the same key must not > > produce > > > > > > > > additional > > > > > > > > > side effects), and I'd like a wider review before updating > > the > > > > > > > > > implementation PR (#4269 < > > > > > > https://github.com/apache/polaris/pull/4269 > > > > > > > >). > > > > > > > > > > > > > > > > > > What changed > > > > > > > > > > > > > > > > > > - Before (Model A, lease-based): reserve an idempotency > row > > > > > before > > > > > > > > doing > > > > > > > > > work → IN_PROGRESS / heartbeat → finalize after. > > > > > > > > > - After (Model B, optimistic commit): run the handler > > first → > > > > > > record > > > > > > > > only > > > > > > > > > after a successful (2xx) outcome. The record stores > binding + > > > > > status, > > > > > > > not > > > > > > > > > the HTTP response body. Retries with the same key re-derive > > an > > > > > > > equivalent > > > > > > > > > response from current catalog state > > > > > > > > > instead of replaying a stored payload. > > > > > > > > > > > > > > > > > > The design doc still compares Model A and Model B > > side-by-side > > > so > > > > > the > > > > > > > > > trade-offs are explicit. So far the discussion has been > > leaning > > > > > > toward > > > > > > > > > Model B — mutating REST operations only, 2xx-only > > persistence, > > > no > > > > > > > > > response-body storage, and the known > > > > > > > > > trade-offs (e.g. concurrent first-request races; see the > > NOTES > > > > > > section > > > > > > > in > > > > > > > > > the doc). > > > > > > > > > > > > > > > > > > Does this direction look right before we lock in the > > > > > implementation? > > > > > > > > > > > > > > > > > > Comments on the doc > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1hqTejVyYXDpL5MJcVc7NyhCslKaGH82QoqMEcUYPvkE/edit?tab=t.0 > > > > > > > > > > > > > > > > > > > or replies on this thread both work. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Huaxin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
