Discussed offline with various folks who were interested in more details about what kinds of standard infra components use 503 for things that should be interpreted as "CommitStateUnknown", so I dug into some common open-source components to track down common sources of such 503s.
Two in particular seem to provide the best indication as to the non-retriability for 503s from mutation requests: 1. Istio itself added a flag for its own retry policy default behaviors on 503: https://github.com/istio/istio/blob/8132a36e67380f5a171937f203f31ff27ebae86d/releasenotes/notes/50506.yaml#L9 **Fixed** default retry policy to exclude retries on 503 which is > potentially unsafe for idempotent requets. > It can be reverted by setting "EXCLUDE_UNSAFE_503_FROM_DEFAULT_RETRY" > to false. Please note that this flag is > only available for backward compatibility reasons and will be removed > in future releases 2. Based on some logs that showed 503 coming from "upstream_reset_before_response_started", this is at least one of the places Envoy does it: https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/envoy/stream_info/stream_info.h#L216 defines EarlyUpstreamReset: // The upstream connection was reset before a response was started. This > // will generally be accompanied by details about why the reset occurred. > const std::string EarlyUpstreamReset = > "upstream_reset_before_response_started"; This is used here: https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/source/common/router/router.cc#L996 > // not buffering any data for retry, shadow, and internal redirect, > and there will be > // no more upstream request, abort the request and clean up. > cleanup(); > callbacks_->sendLocalReply( > Http::Code::ServiceUnavailable, > "upstream is closed prematurely during decoding data from > downstream", modify_headers_, > absl::nullopt, > StreamInfo::ResponseCodeDetails::get().EarlyUpstreamReset); > return Http::FilterDataStatus::StopIterationNoBuffer; On Wed, Jun 18, 2025 at 9:43 PM Dennis Huo <huoi...@gmail.com> wrote: > Sounds like the confusion is indeed stemming from anchoring on behaviors > of 502 and 504 as established prior art while not applying the same initial > decision factors to 503. > > If we don't trust 5xx error codes, then wouldn't it make the most sense to >> throw CommitStateUnknown for any 5xx response? > > > I think this is the crux of the issue, and I agree with this statement > entirely. It appears most prudent to treat any 5xx response as something > that requires higher-level reconciliation of state before attempting and > retries, so in general: > > 1. Any low-level client retries should not happen in response to 5xx errors > 2. These should propagate out as CommitStateUnknown consistently > > I think there's some variation in how folks are interpreting the meaning > of being unable to "trust 5xx error codes" though -- certainly, if 503 is > more universally understood to mean a *guarantee* that the backend *did > not* and *will not* process the contents of the request, then we're > talking about "not trusting that the server returned the correct 5xx error > code". > > On the other hand, if the 503 RFC is understood to *not make guarantees* > about whether the backend did or did not start processing contents, then > we're talking about "don't trust your assumptions about server state after > receiving a 503". > > Based on just a cursory search from different service domains, it looks > like 503 does not generally provide guarantees about no processing. For > example: > > 1. At least one managed service observes the Istio layer returning 503s > during a maintenance event when the underlying k8s application did not yet > produce a response at all > 2. > https://www.hostwinds.com/blog/503-error-common-causes-and-how-to-fix-it > lists "3. Backend Fetch Failed", "4. Connection Timeout", and "7. Out of > Server Resources" as 503 causes, all of which are semantically closer to a > 502 than a 429 (where 429 is more similar to the listed "Server Overloaded" > or "Server Under Maintenance" items) > 3. > https://support.google.com/business/thread/10553916/started-getting-this-error-today-googlejsonresponseexception-503-service-unavailable?hl=en > Shows some APIs explicitly stating 503 as a "partially processed" state: > > "code" : 503, > "errors" : [ { > "domain" : "global", > "message" : "The service encountered a transient error. Part of the > request may have succeeeded. Refer to details for which fields failed.", > "reason" : "backendError" > } ], > > My two cents would be to go ahead with consistent handling of 500, 502, > 503, and 504 at least. > > 501 and 505 might be universally understood to be fast-rejections, but I > don't feel strongly about those. > > On Wed, Jun 18, 2025 at 6:03 PM Prashant Singh <prashant010...@gmail.com> > wrote: > >> EDIT: corrected some typos: >> >> Hey Ryan, >> My reasoning was the following: we retry on 502 / 504 as well here >> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, >> we have a spec definition stating here >> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> >> that >> it can be unknown, so says the RFC >> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> as >> something went wrong in the middle of processing the request. So we retry >> with 502 and 504 we can conflict with ourselves, as we don't know when we >> receive the status of 429 409 is it because we retried on 502 and then >> got 429 409 or something else happened, so I thought it's better to >> throw the commit state unknown if we know the commit has been retried. >> >> The Iceberg users who accompanied this was with 504 : slack thread - >> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >> >> We have similar handling for glue as we did here >> https://github.com/apache/iceberg/pull/7198 as there are internal http >> retries over the glue SDK. >> >> and hence we did it this way. I do agree 429 / 503 is unnecessary being >> wrapped and we can miss the opportunity to clean, but it doesn't hurt as >> alternative is we can capture all the codes we got and inspect the >> permutation and combination of the status code, just wanted to avoid that, >> plus i admit there was a bug in our service which we are chasing and fixing >> as well where its throwing 503 instead of 502.... but the thing is even >> with that the problem issue will still exist due to retries if they are not >> handled properly. >> >> That was the thought process of mine while doing that, and happy to >> incorporate your feedback : >> >> Maybe another alternative is that we take out 502 / 504 from the retry >> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501 >> are the final status code ? rest all are unknown >> >> Best, >> Prashant Singh >> >> On Wed, Jun 18, 2025 at 4:21 PM Prashant Singh <prashant010...@gmail.com> >> wrote: >> >>> Hey Ryan, >>> My reasoning was the following: we retry on 502 / 504 as well here >>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, >>> we have a spec definition stating here >>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> >>> that >>> it can be unknown, so says the RFC >>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> >>> as something went wrong in the middle of processing the request. So we >>> retry with 502 and 504 we can conflict with ourselves, as we don't know >>> when we receive the status of 429 is it because we retried on 502 and then >>> got 429 or something else happened, so I thought it's better to throw the >>> commit state unknown if we know the commit has been retried. >>> >>> The Iceberg users who accompanied this was with 504 : slack thread - >>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>> >>> We have similar handling for glue as we did here >>> https://github.com/apache/iceberg/pull/7198 as there are internal http >>> retries over the glue SDK. >>> >>> and hence we did it this way. I do agree 429 / 503 is unnecessary being >>> wrapped and we can miss the opportunity to clean, but it doesn't hurt as >>> alternative is we can capture all the codes we got and inspect the >>> permutation and combination of the status code, just wanted to avoid that, >>> plus i admit there was a bug in our service which we are chasing and fixing >>> as well where its throwing 503 instead of 502.... but the thing is even >>> with that the problem issue will still exist due to retries if they are not >>> handled properly. >>> >>> That was the thought process of mine while doing that, and happy to >>> incorporate your feedback : >>> >>> Maybe another alternative is that we take out 502 / 504 from the retry >>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501 >>> are the final status code ? rest all are unknown >>> >>> Best, >>> Prashant Singh >>> >>> On Wed, Jun 18, 2025 at 12:22 PM Ryan Blue <rdb...@gmail.com> wrote: >>> >>>> Are we confident that the REST client fix is the right path? >>>> >>>> If I understand correctly, the issue is that there is a load balancer >>>> that is converting a 500 response to a 503 that the client will >>>> automatically retry, even though the 500 response is not retry-able. Then >>>> the retry results in a 409 because the commit conflicts with itself. >>>> >>>> I'm fairly concerned about a fix like this that is operating on the >>>> assumption that we cannot trust the 5xx error code that was sent to the >>>> client. If we don't trust 5xx error codes, then wouldn't it make the most >>>> sense to throw CommitStateUnknown for any 5xx response? >>>> >>>> On Tue, Jun 17, 2025 at 1:00 PM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> Sorry I didn't get to reply here, I think the fix Ajantha is >>>>> contributing is extremely important but probably out of scope for a patch >>>>> release. Because it will take a bit of manual intervention to fix after >>>>> jumping to the next version I think we should save this for 1.10.0 which >>>>> also should come out pretty soon. >>>>> >>>>> On Tue, Jun 17, 2025 at 1:16 PM Prashant Singh < >>>>> prashant010...@gmail.com> wrote: >>>>> >>>>>> Thank you for confirming over slack Ajantha, I also double checked >>>>>> with Russell offline, this seems to be a behaviour change which can't go >>>>>> in >>>>>> a patch release, maybe 1.10 then. So I think we should be good for now. >>>>>> That being said, I will start working on getting a RC out with just >>>>>> this commit >>>>>> <https://github.com/singhpk234/iceberg/commit/14ba08e672df5061f8ac0978ea81f10535da2246> >>>>>> on top of 1.9.1, and will start a thread for that accordingly ! >>>>>> >>>>>> Best, >>>>>> Prashant Singh >>>>>> >>>>>> On Tue, Jun 17, 2025 at 8:44 AM Prashant Singh < >>>>>> prashant010...@gmail.com> wrote: >>>>>> >>>>>>> Hey Ajantha, >>>>>>> >>>>>>> I was going to wait for 2-3 days before cutting an RC anyway :), to >>>>>>> see if anyone has an objection or some more *critical* changes to >>>>>>> get in. Thank you for the heads up ! >>>>>>> >>>>>>> Best, >>>>>>> Prashant Singh >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 16, 2025 at 11:02 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I have a PR that needs to be included for 1.9.2 as well! >>>>>>>> >>>>>>>> I was about to start a release discussion for 1.9.2 tomorrow. >>>>>>>> >>>>>>>> It is from an oversight during partition stats refactoring (in >>>>>>>> 1.9.0). We missed that field ids are tracked in spec during >>>>>>>> refactoring! >>>>>>>> Because of it, schema field ids are not as per partition stats spec. >>>>>>>> Solution is to implicitly recompute stats when old stats are detected >>>>>>>> with >>>>>>>> invalid field id. Including the fix in 1.9.2 can reduce the issue >>>>>>>> impact as >>>>>>>> new stats will have valid schema ids and also handle corrupted stats. >>>>>>>> Feature is not widely integrated yet and engine integrations are in >>>>>>>> progress for 1.10.0. But we still want to reduce the impact of the >>>>>>>> issue. I >>>>>>>> have chatted with Peter and Russell about this. >>>>>>>> >>>>>>>> PR will be shared tomorrow. >>>>>>>> >>>>>>>> Just giving heads up to wait a day or two for a release cut. >>>>>>>> >>>>>>>> - Ajantha >>>>>>>> >>>>>>>> On Mon, Jun 16, 2025 at 11:25 PM Steven Wu <stevenz...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> +1 for a 1.9.2 release >>>>>>>>> >>>>>>>>> On Mon, Jun 16, 2025 at 10:53 AM Prashant Singh < >>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hey Kevin, >>>>>>>>>> This goes well before 1.8, if you will see the issue that my PR >>>>>>>>>> refers to is reported from iceberg 1.7, It has been there since the >>>>>>>>>> beginning of the IRC client. >>>>>>>>>> We were having similar debates on if we should patch all the >>>>>>>>>> releases, but I think this requires more wider discussions, but 1.9.2 >>>>>>>>>> sounds like the immediate next steps ! >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Prashant Singh >>>>>>>>>> >>>>>>>>>> On Mon, Jun 16, 2025 at 10:42 AM Kevin Liu <kevinjq...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Prashant, >>>>>>>>>>> >>>>>>>>>>> This sounds like a good reason to do a patch release. I'm +1 >>>>>>>>>>> >>>>>>>>>>> Do you know if this is also affecting other minor versions? Do >>>>>>>>>>> we also need to patch 1.8.x? >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Kevin Liu >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 16, 2025 at 10:30 AM Prashant Singh < >>>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey all, >>>>>>>>>>>> >>>>>>>>>>>> A couple of users have recently reported a *critical table >>>>>>>>>>>> corruption issue* when using the Iceberg REST catalog client. >>>>>>>>>>>> This issue has existed from the beginning and can result in >>>>>>>>>>>> *committed >>>>>>>>>>>> snapshot data being incorrectly deleted*, effectively >>>>>>>>>>>> rendering the table unusable (i.e., corrupted). The root cause is >>>>>>>>>>>> *self-conflict >>>>>>>>>>>> during internal HTTP client retries*. >>>>>>>>>>>> >>>>>>>>>>>> More context can be found here: >>>>>>>>>>>> [1] >>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818#issue-3000325526 >>>>>>>>>>>> [2] >>>>>>>>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>>>>>>>>>> >>>>>>>>>>>> We’ve seen this issue hit by users in the wider community and >>>>>>>>>>>> also internally on our end. I’ve submitted a fix here (now merged) >>>>>>>>>>>> : >>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818 >>>>>>>>>>>> >>>>>>>>>>>> Given the *severity* and the *broad version impact*, I’d like >>>>>>>>>>>> to propose cutting a *1.9.2 patch release* to include this fix. >>>>>>>>>>>> >>>>>>>>>>>> I'm also happy to *volunteer as the release manager* for this >>>>>>>>>>>> release. >>>>>>>>>>>> >>>>>>>>>>>> Looking forward to hearing the community's thoughts. >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> Prashant Singh >>>>>>>>>>>> >>>>>>>>>>>