Thanks for the background, Dennis ! Thanks Ryan, It seems like we are ok in treating all 5xx errors as CommitStateUnknown and hence not retry, considering what spec says about 502 <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070>, 504 <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1084> vs 503 <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L4648>, I think it requires spec changes considering we now require 503 to throw CommitStateUnknown.
As 1.9.2 is a patch release, from my understanding, it should not cause behaviour changes, as treating 503 the same as 502 / 504 causes behaviour changes, IMHO ! Given that PR-13352 <https://github.com/apache/iceberg/pull/13352> got in, I started a vote thread for the same. Best, Prashant Singh On Thu, Jun 26, 2025 at 3:12 PM Ryan Blue <rdb...@gmail.com> wrote: > Thanks for all the background, Dennis! I think I agree that we should > treat all 5xx errors as CommitStateUnknown because we don't necessarily > know what was passed to the service. We would need to update the spec, > right? > > On Thu, Jun 26, 2025 at 3:03 PM Dennis Huo <huoi...@gmail.com> wrote: > >> Discussed offline with various folks who were interested in more details >> about what kinds of standard infra components use 503 for things that >> should be interpreted as "CommitStateUnknown", so I dug into some common >> open-source components to track down common sources of such 503s. >> >> Two in particular seem to provide the best indication as to the >> non-retriability for 503s from mutation requests: >> >> 1. Istio itself added a flag for its own retry policy default behaviors >> on 503: >> https://github.com/istio/istio/blob/8132a36e67380f5a171937f203f31ff27ebae86d/releasenotes/notes/50506.yaml#L9 >> >> **Fixed** default retry policy to exclude retries on 503 which is >>> potentially unsafe for idempotent requets. >>> It can be reverted by setting >>> "EXCLUDE_UNSAFE_503_FROM_DEFAULT_RETRY" to false. Please note that this >>> flag is >>> only available for backward compatibility reasons and will be >>> removed in future releases >> >> >> 2. Based on some logs that showed 503 coming from >> "upstream_reset_before_response_started", this is at least one of the >> places Envoy does it: >> >> >> https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/envoy/stream_info/stream_info.h#L216 >> >> defines EarlyUpstreamReset: >> >> // The upstream connection was reset before a response was started. This >>> // will generally be accompanied by details about why the reset >>> occurred. >>> const std::string EarlyUpstreamReset = >>> "upstream_reset_before_response_started"; >> >> >> This is used here: >> https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/source/common/router/router.cc#L996 >> >> >>> // not buffering any data for retry, shadow, and internal >>> redirect, and there will be >>> // no more upstream request, abort the request and clean up. >>> cleanup(); >>> callbacks_->sendLocalReply( >>> Http::Code::ServiceUnavailable, >>> "upstream is closed prematurely during decoding data from >>> downstream", modify_headers_, >>> absl::nullopt, >>> StreamInfo::ResponseCodeDetails::get().EarlyUpstreamReset); >>> return Http::FilterDataStatus::StopIterationNoBuffer; >> >> >> On Wed, Jun 18, 2025 at 9:43 PM Dennis Huo <huoi...@gmail.com> wrote: >> >>> Sounds like the confusion is indeed stemming from anchoring on behaviors >>> of 502 and 504 as established prior art while not applying the same initial >>> decision factors to 503. >>> >>> If we don't trust 5xx error codes, then wouldn't it make the most sense >>>> to throw CommitStateUnknown for any 5xx response? >>> >>> >>> I think this is the crux of the issue, and I agree with this statement >>> entirely. It appears most prudent to treat any 5xx response as something >>> that requires higher-level reconciliation of state before attempting and >>> retries, so in general: >>> >>> 1. Any low-level client retries should not happen in response to 5xx >>> errors >>> 2. These should propagate out as CommitStateUnknown consistently >>> >>> I think there's some variation in how folks are interpreting the meaning >>> of being unable to "trust 5xx error codes" though -- certainly, if 503 is >>> more universally understood to mean a *guarantee* that the backend *did >>> not* and *will not* process the contents of the request, then we're >>> talking about "not trusting that the server returned the correct 5xx error >>> code". >>> >>> On the other hand, if the 503 RFC is understood to *not make guarantees* >>> about whether the backend did or did not start processing contents, then >>> we're talking about "don't trust your assumptions about server state after >>> receiving a 503". >>> >>> Based on just a cursory search from different service domains, it looks >>> like 503 does not generally provide guarantees about no processing. For >>> example: >>> >>> 1. At least one managed service observes the Istio layer returning 503s >>> during a maintenance event when the underlying k8s application did not yet >>> produce a response at all >>> 2. >>> https://www.hostwinds.com/blog/503-error-common-causes-and-how-to-fix-it >>> lists "3. Backend Fetch Failed", "4. Connection Timeout", and "7. Out of >>> Server Resources" as 503 causes, all of which are semantically closer to a >>> 502 than a 429 (where 429 is more similar to the listed "Server Overloaded" >>> or "Server Under Maintenance" items) >>> 3. >>> https://support.google.com/business/thread/10553916/started-getting-this-error-today-googlejsonresponseexception-503-service-unavailable?hl=en >>> Shows some APIs explicitly stating 503 as a "partially processed" state: >>> >>> "code" : 503, >>> "errors" : [ { >>> "domain" : "global", >>> "message" : "The service encountered a transient error. Part of the >>> request may have succeeeded. Refer to details for which fields failed.", >>> "reason" : "backendError" >>> } ], >>> >>> My two cents would be to go ahead with consistent handling of 500, 502, >>> 503, and 504 at least. >>> >>> 501 and 505 might be universally understood to be fast-rejections, but I >>> don't feel strongly about those. >>> >>> On Wed, Jun 18, 2025 at 6:03 PM Prashant Singh <prashant010...@gmail.com> >>> wrote: >>> >>>> EDIT: corrected some typos: >>>> >>>> Hey Ryan, >>>> My reasoning was the following: we retry on 502 / 504 as well here >>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, >>>> we have a spec definition stating here >>>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> >>>> that >>>> it can be unknown, so says the RFC >>>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> >>>> as something went wrong in the middle of processing the request. So we >>>> retry with 502 and 504 we can conflict with ourselves, as we don't know >>>> when we receive the status of 429 409 is it because we retried on 502 >>>> and then got 429 409 or something else happened, so I thought it's >>>> better to throw the commit state unknown if we know the commit has been >>>> retried. >>>> >>>> The Iceberg users who accompanied this was with 504 : slack thread - >>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>> >>>> We have similar handling for glue as we did here >>>> https://github.com/apache/iceberg/pull/7198 as there are internal >>>> http retries over the glue SDK. >>>> >>>> and hence we did it this way. I do agree 429 / 503 is unnecessary being >>>> wrapped and we can miss the opportunity to clean, but it doesn't hurt as >>>> alternative is we can capture all the codes we got and inspect the >>>> permutation and combination of the status code, just wanted to avoid that, >>>> plus i admit there was a bug in our service which we are chasing and fixing >>>> as well where its throwing 503 instead of 502.... but the thing is even >>>> with that the problem issue will still exist due to retries if they are not >>>> handled properly. >>>> >>>> That was the thought process of mine while doing that, and happy to >>>> incorporate your feedback : >>>> >>>> Maybe another alternative is that we take out 502 / 504 from the retry >>>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501 >>>> are the final status code ? rest all are unknown >>>> >>>> Best, >>>> Prashant Singh >>>> >>>> On Wed, Jun 18, 2025 at 4:21 PM Prashant Singh < >>>> prashant010...@gmail.com> wrote: >>>> >>>>> Hey Ryan, >>>>> My reasoning was the following: we retry on 502 / 504 as well here >>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, >>>>> we have a spec definition stating here >>>>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> >>>>> that >>>>> it can be unknown, so says the RFC >>>>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> >>>>> as something went wrong in the middle of processing the request. So we >>>>> retry with 502 and 504 we can conflict with ourselves, as we don't know >>>>> when we receive the status of 429 is it because we retried on 502 and then >>>>> got 429 or something else happened, so I thought it's better to throw the >>>>> commit state unknown if we know the commit has been retried. >>>>> >>>>> The Iceberg users who accompanied this was with 504 : slack thread - >>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>>> >>>>> We have similar handling for glue as we did here >>>>> https://github.com/apache/iceberg/pull/7198 as there are internal >>>>> http retries over the glue SDK. >>>>> >>>>> and hence we did it this way. I do agree 429 / 503 is unnecessary >>>>> being wrapped and we can miss the opportunity to clean, but it doesn't >>>>> hurt as alternative is we can capture all the codes we got and inspect >>>>> the >>>>> permutation and combination of the status code, just wanted to avoid that, >>>>> plus i admit there was a bug in our service which we are chasing and >>>>> fixing >>>>> as well where its throwing 503 instead of 502.... but the thing is even >>>>> with that the problem issue will still exist due to retries if they are >>>>> not >>>>> handled properly. >>>>> >>>>> That was the thought process of mine while doing that, and happy to >>>>> incorporate your feedback : >>>>> >>>>> Maybe another alternative is that we take out 502 / 504 from the retry >>>>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501 >>>>> are the final status code ? rest all are unknown >>>>> >>>>> Best, >>>>> Prashant Singh >>>>> >>>>> On Wed, Jun 18, 2025 at 12:22 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>> >>>>>> Are we confident that the REST client fix is the right path? >>>>>> >>>>>> If I understand correctly, the issue is that there is a load balancer >>>>>> that is converting a 500 response to a 503 that the client will >>>>>> automatically retry, even though the 500 response is not retry-able. Then >>>>>> the retry results in a 409 because the commit conflicts with itself. >>>>>> >>>>>> I'm fairly concerned about a fix like this that is operating on the >>>>>> assumption that we cannot trust the 5xx error code that was sent to the >>>>>> client. If we don't trust 5xx error codes, then wouldn't it make the most >>>>>> sense to throw CommitStateUnknown for any 5xx response? >>>>>> >>>>>> On Tue, Jun 17, 2025 at 1:00 PM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> Sorry I didn't get to reply here, I think the fix Ajantha is >>>>>>> contributing is extremely important but probably out of scope for a >>>>>>> patch >>>>>>> release. Because it will take a bit of manual intervention to fix after >>>>>>> jumping to the next version I think we should save this for 1.10.0 which >>>>>>> also should come out pretty soon. >>>>>>> >>>>>>> On Tue, Jun 17, 2025 at 1:16 PM Prashant Singh < >>>>>>> prashant010...@gmail.com> wrote: >>>>>>> >>>>>>>> Thank you for confirming over slack Ajantha, I also double checked >>>>>>>> with Russell offline, this seems to be a behaviour change which can't >>>>>>>> go in >>>>>>>> a patch release, maybe 1.10 then. So I think we should be good for now. >>>>>>>> That being said, I will start working on getting a RC out with just >>>>>>>> this commit >>>>>>>> <https://github.com/singhpk234/iceberg/commit/14ba08e672df5061f8ac0978ea81f10535da2246> >>>>>>>> on top of 1.9.1, and will start a thread for that accordingly ! >>>>>>>> >>>>>>>> Best, >>>>>>>> Prashant Singh >>>>>>>> >>>>>>>> On Tue, Jun 17, 2025 at 8:44 AM Prashant Singh < >>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hey Ajantha, >>>>>>>>> >>>>>>>>> I was going to wait for 2-3 days before cutting an RC anyway :), >>>>>>>>> to see if anyone has an objection or some more *critical* changes >>>>>>>>> to get in. Thank you for the heads up ! >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Prashant Singh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 16, 2025 at 11:02 AM Ajantha Bhat < >>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I have a PR that needs to be included for 1.9.2 as well! >>>>>>>>>> >>>>>>>>>> I was about to start a release discussion for 1.9.2 tomorrow. >>>>>>>>>> >>>>>>>>>> It is from an oversight during partition stats refactoring (in >>>>>>>>>> 1.9.0). We missed that field ids are tracked in spec during >>>>>>>>>> refactoring! >>>>>>>>>> Because of it, schema field ids are not as per partition stats spec. >>>>>>>>>> Solution is to implicitly recompute stats when old stats are >>>>>>>>>> detected with >>>>>>>>>> invalid field id. Including the fix in 1.9.2 can reduce the issue >>>>>>>>>> impact as >>>>>>>>>> new stats will have valid schema ids and also handle corrupted stats. >>>>>>>>>> Feature is not widely integrated yet and engine integrations are in >>>>>>>>>> progress for 1.10.0. But we still want to reduce the impact of the >>>>>>>>>> issue. I >>>>>>>>>> have chatted with Peter and Russell about this. >>>>>>>>>> >>>>>>>>>> PR will be shared tomorrow. >>>>>>>>>> >>>>>>>>>> Just giving heads up to wait a day or two for a release cut. >>>>>>>>>> >>>>>>>>>> - Ajantha >>>>>>>>>> >>>>>>>>>> On Mon, Jun 16, 2025 at 11:25 PM Steven Wu <stevenz...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> +1 for a 1.9.2 release >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 16, 2025 at 10:53 AM Prashant Singh < >>>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey Kevin, >>>>>>>>>>>> This goes well before 1.8, if you will see the issue that my PR >>>>>>>>>>>> refers to is reported from iceberg 1.7, It has been there since the >>>>>>>>>>>> beginning of the IRC client. >>>>>>>>>>>> We were having similar debates on if we should patch all the >>>>>>>>>>>> releases, but I think this requires more wider discussions, but >>>>>>>>>>>> 1.9.2 >>>>>>>>>>>> sounds like the immediate next steps ! >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Prashant Singh >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:42 AM Kevin Liu < >>>>>>>>>>>> kevinjq...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Prashant, >>>>>>>>>>>>> >>>>>>>>>>>>> This sounds like a good reason to do a patch release. I'm +1 >>>>>>>>>>>>> >>>>>>>>>>>>> Do you know if this is also affecting other minor versions? Do >>>>>>>>>>>>> we also need to patch 1.8.x? >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Kevin Liu >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:30 AM Prashant Singh < >>>>>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hey all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> A couple of users have recently reported a *critical table >>>>>>>>>>>>>> corruption issue* when using the Iceberg REST catalog >>>>>>>>>>>>>> client. This issue has existed from the beginning and can result >>>>>>>>>>>>>> in *committed >>>>>>>>>>>>>> snapshot data being incorrectly deleted*, effectively >>>>>>>>>>>>>> rendering the table unusable (i.e., corrupted). The root cause >>>>>>>>>>>>>> is *self-conflict >>>>>>>>>>>>>> during internal HTTP client retries*. >>>>>>>>>>>>>> >>>>>>>>>>>>>> More context can be found here: >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818#issue-3000325526 >>>>>>>>>>>>>> [2] >>>>>>>>>>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>>>>>>>>>>>> >>>>>>>>>>>>>> We’ve seen this issue hit by users in the wider community and >>>>>>>>>>>>>> also internally on our end. I’ve submitted a fix here (now >>>>>>>>>>>>>> merged) : >>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Given the *severity* and the *broad version impact*, I’d >>>>>>>>>>>>>> like to propose cutting a *1.9.2 patch release* to include >>>>>>>>>>>>>> this fix. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm also happy to *volunteer as the release manager* for >>>>>>>>>>>>>> this release. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Looking forward to hearing the community's thoughts. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>> Prashant Singh >>>>>>>>>>>>>> >>>>>>>>>>>>>