Thanks for all the background, Dennis! I think I agree that we should treat all 5xx errors as CommitStateUnknown because we don't necessarily know what was passed to the service. We would need to update the spec, right?
On Thu, Jun 26, 2025 at 3:03 PM Dennis Huo <huoi...@gmail.com> wrote: > Discussed offline with various folks who were interested in more details > about what kinds of standard infra components use 503 for things that > should be interpreted as "CommitStateUnknown", so I dug into some common > open-source components to track down common sources of such 503s. > > Two in particular seem to provide the best indication as to the > non-retriability for 503s from mutation requests: > > 1. Istio itself added a flag for its own retry policy default behaviors on > 503: > https://github.com/istio/istio/blob/8132a36e67380f5a171937f203f31ff27ebae86d/releasenotes/notes/50506.yaml#L9 > > **Fixed** default retry policy to exclude retries on 503 which is >> potentially unsafe for idempotent requets. >> It can be reverted by setting "EXCLUDE_UNSAFE_503_FROM_DEFAULT_RETRY" >> to false. Please note that this flag is >> only available for backward compatibility reasons and will be removed >> in future releases > > > 2. Based on some logs that showed 503 coming from > "upstream_reset_before_response_started", this is at least one of the > places Envoy does it: > > > https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/envoy/stream_info/stream_info.h#L216 > > defines EarlyUpstreamReset: > > // The upstream connection was reset before a response was started. This >> // will generally be accompanied by details about why the reset >> occurred. >> const std::string EarlyUpstreamReset = >> "upstream_reset_before_response_started"; > > > This is used here: > https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/source/common/router/router.cc#L996 > > >> // not buffering any data for retry, shadow, and internal redirect, >> and there will be >> // no more upstream request, abort the request and clean up. >> cleanup(); >> callbacks_->sendLocalReply( >> Http::Code::ServiceUnavailable, >> "upstream is closed prematurely during decoding data from >> downstream", modify_headers_, >> absl::nullopt, >> StreamInfo::ResponseCodeDetails::get().EarlyUpstreamReset); >> return Http::FilterDataStatus::StopIterationNoBuffer; > > > On Wed, Jun 18, 2025 at 9:43 PM Dennis Huo <huoi...@gmail.com> wrote: > >> Sounds like the confusion is indeed stemming from anchoring on behaviors >> of 502 and 504 as established prior art while not applying the same initial >> decision factors to 503. >> >> If we don't trust 5xx error codes, then wouldn't it make the most sense >>> to throw CommitStateUnknown for any 5xx response? >> >> >> I think this is the crux of the issue, and I agree with this statement >> entirely. It appears most prudent to treat any 5xx response as something >> that requires higher-level reconciliation of state before attempting and >> retries, so in general: >> >> 1. Any low-level client retries should not happen in response to 5xx >> errors >> 2. These should propagate out as CommitStateUnknown consistently >> >> I think there's some variation in how folks are interpreting the meaning >> of being unable to "trust 5xx error codes" though -- certainly, if 503 is >> more universally understood to mean a *guarantee* that the backend *did >> not* and *will not* process the contents of the request, then we're >> talking about "not trusting that the server returned the correct 5xx error >> code". >> >> On the other hand, if the 503 RFC is understood to *not make guarantees* >> about whether the backend did or did not start processing contents, then >> we're talking about "don't trust your assumptions about server state after >> receiving a 503". >> >> Based on just a cursory search from different service domains, it looks >> like 503 does not generally provide guarantees about no processing. For >> example: >> >> 1. At least one managed service observes the Istio layer returning 503s >> during a maintenance event when the underlying k8s application did not yet >> produce a response at all >> 2. >> https://www.hostwinds.com/blog/503-error-common-causes-and-how-to-fix-it >> lists "3. Backend Fetch Failed", "4. Connection Timeout", and "7. Out of >> Server Resources" as 503 causes, all of which are semantically closer to a >> 502 than a 429 (where 429 is more similar to the listed "Server Overloaded" >> or "Server Under Maintenance" items) >> 3. >> https://support.google.com/business/thread/10553916/started-getting-this-error-today-googlejsonresponseexception-503-service-unavailable?hl=en >> Shows some APIs explicitly stating 503 as a "partially processed" state: >> >> "code" : 503, >> "errors" : [ { >> "domain" : "global", >> "message" : "The service encountered a transient error. Part of the >> request may have succeeeded. Refer to details for which fields failed.", >> "reason" : "backendError" >> } ], >> >> My two cents would be to go ahead with consistent handling of 500, 502, >> 503, and 504 at least. >> >> 501 and 505 might be universally understood to be fast-rejections, but I >> don't feel strongly about those. >> >> On Wed, Jun 18, 2025 at 6:03 PM Prashant Singh <prashant010...@gmail.com> >> wrote: >> >>> EDIT: corrected some typos: >>> >>> Hey Ryan, >>> My reasoning was the following: we retry on 502 / 504 as well here >>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, >>> we have a spec definition stating here >>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> >>> that >>> it can be unknown, so says the RFC >>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> >>> as something went wrong in the middle of processing the request. So we >>> retry with 502 and 504 we can conflict with ourselves, as we don't know >>> when we receive the status of 429 409 is it because we retried on 502 >>> and then got 429 409 or something else happened, so I thought it's >>> better to throw the commit state unknown if we know the commit has been >>> retried. >>> >>> The Iceberg users who accompanied this was with 504 : slack thread - >>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>> >>> We have similar handling for glue as we did here >>> https://github.com/apache/iceberg/pull/7198 as there are internal http >>> retries over the glue SDK. >>> >>> and hence we did it this way. I do agree 429 / 503 is unnecessary being >>> wrapped and we can miss the opportunity to clean, but it doesn't hurt as >>> alternative is we can capture all the codes we got and inspect the >>> permutation and combination of the status code, just wanted to avoid that, >>> plus i admit there was a bug in our service which we are chasing and fixing >>> as well where its throwing 503 instead of 502.... but the thing is even >>> with that the problem issue will still exist due to retries if they are not >>> handled properly. >>> >>> That was the thought process of mine while doing that, and happy to >>> incorporate your feedback : >>> >>> Maybe another alternative is that we take out 502 / 504 from the retry >>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501 >>> are the final status code ? rest all are unknown >>> >>> Best, >>> Prashant Singh >>> >>> On Wed, Jun 18, 2025 at 4:21 PM Prashant Singh <prashant010...@gmail.com> >>> wrote: >>> >>>> Hey Ryan, >>>> My reasoning was the following: we retry on 502 / 504 as well here >>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, >>>> we have a spec definition stating here >>>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> >>>> that >>>> it can be unknown, so says the RFC >>>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> >>>> as something went wrong in the middle of processing the request. So we >>>> retry with 502 and 504 we can conflict with ourselves, as we don't know >>>> when we receive the status of 429 is it because we retried on 502 and then >>>> got 429 or something else happened, so I thought it's better to throw the >>>> commit state unknown if we know the commit has been retried. >>>> >>>> The Iceberg users who accompanied this was with 504 : slack thread - >>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>> >>>> We have similar handling for glue as we did here >>>> https://github.com/apache/iceberg/pull/7198 as there are internal >>>> http retries over the glue SDK. >>>> >>>> and hence we did it this way. I do agree 429 / 503 is unnecessary being >>>> wrapped and we can miss the opportunity to clean, but it doesn't hurt as >>>> alternative is we can capture all the codes we got and inspect the >>>> permutation and combination of the status code, just wanted to avoid that, >>>> plus i admit there was a bug in our service which we are chasing and fixing >>>> as well where its throwing 503 instead of 502.... but the thing is even >>>> with that the problem issue will still exist due to retries if they are not >>>> handled properly. >>>> >>>> That was the thought process of mine while doing that, and happy to >>>> incorporate your feedback : >>>> >>>> Maybe another alternative is that we take out 502 / 504 from the retry >>>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501 >>>> are the final status code ? rest all are unknown >>>> >>>> Best, >>>> Prashant Singh >>>> >>>> On Wed, Jun 18, 2025 at 12:22 PM Ryan Blue <rdb...@gmail.com> wrote: >>>> >>>>> Are we confident that the REST client fix is the right path? >>>>> >>>>> If I understand correctly, the issue is that there is a load balancer >>>>> that is converting a 500 response to a 503 that the client will >>>>> automatically retry, even though the 500 response is not retry-able. Then >>>>> the retry results in a 409 because the commit conflicts with itself. >>>>> >>>>> I'm fairly concerned about a fix like this that is operating on the >>>>> assumption that we cannot trust the 5xx error code that was sent to the >>>>> client. If we don't trust 5xx error codes, then wouldn't it make the most >>>>> sense to throw CommitStateUnknown for any 5xx response? >>>>> >>>>> On Tue, Jun 17, 2025 at 1:00 PM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> Sorry I didn't get to reply here, I think the fix Ajantha is >>>>>> contributing is extremely important but probably out of scope for a patch >>>>>> release. Because it will take a bit of manual intervention to fix after >>>>>> jumping to the next version I think we should save this for 1.10.0 which >>>>>> also should come out pretty soon. >>>>>> >>>>>> On Tue, Jun 17, 2025 at 1:16 PM Prashant Singh < >>>>>> prashant010...@gmail.com> wrote: >>>>>> >>>>>>> Thank you for confirming over slack Ajantha, I also double checked >>>>>>> with Russell offline, this seems to be a behaviour change which can't >>>>>>> go in >>>>>>> a patch release, maybe 1.10 then. So I think we should be good for now. >>>>>>> That being said, I will start working on getting a RC out with just >>>>>>> this commit >>>>>>> <https://github.com/singhpk234/iceberg/commit/14ba08e672df5061f8ac0978ea81f10535da2246> >>>>>>> on top of 1.9.1, and will start a thread for that accordingly ! >>>>>>> >>>>>>> Best, >>>>>>> Prashant Singh >>>>>>> >>>>>>> On Tue, Jun 17, 2025 at 8:44 AM Prashant Singh < >>>>>>> prashant010...@gmail.com> wrote: >>>>>>> >>>>>>>> Hey Ajantha, >>>>>>>> >>>>>>>> I was going to wait for 2-3 days before cutting an RC anyway :), >>>>>>>> to see if anyone has an objection or some more *critical* changes >>>>>>>> to get in. Thank you for the heads up ! >>>>>>>> >>>>>>>> Best, >>>>>>>> Prashant Singh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 16, 2025 at 11:02 AM Ajantha Bhat < >>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I have a PR that needs to be included for 1.9.2 as well! >>>>>>>>> >>>>>>>>> I was about to start a release discussion for 1.9.2 tomorrow. >>>>>>>>> >>>>>>>>> It is from an oversight during partition stats refactoring (in >>>>>>>>> 1.9.0). We missed that field ids are tracked in spec during >>>>>>>>> refactoring! >>>>>>>>> Because of it, schema field ids are not as per partition stats spec. >>>>>>>>> Solution is to implicitly recompute stats when old stats are detected >>>>>>>>> with >>>>>>>>> invalid field id. Including the fix in 1.9.2 can reduce the issue >>>>>>>>> impact as >>>>>>>>> new stats will have valid schema ids and also handle corrupted stats. >>>>>>>>> Feature is not widely integrated yet and engine integrations are in >>>>>>>>> progress for 1.10.0. But we still want to reduce the impact of the >>>>>>>>> issue. I >>>>>>>>> have chatted with Peter and Russell about this. >>>>>>>>> >>>>>>>>> PR will be shared tomorrow. >>>>>>>>> >>>>>>>>> Just giving heads up to wait a day or two for a release cut. >>>>>>>>> >>>>>>>>> - Ajantha >>>>>>>>> >>>>>>>>> On Mon, Jun 16, 2025 at 11:25 PM Steven Wu <stevenz...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +1 for a 1.9.2 release >>>>>>>>>> >>>>>>>>>> On Mon, Jun 16, 2025 at 10:53 AM Prashant Singh < >>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hey Kevin, >>>>>>>>>>> This goes well before 1.8, if you will see the issue that my PR >>>>>>>>>>> refers to is reported from iceberg 1.7, It has been there since the >>>>>>>>>>> beginning of the IRC client. >>>>>>>>>>> We were having similar debates on if we should patch all the >>>>>>>>>>> releases, but I think this requires more wider discussions, but >>>>>>>>>>> 1.9.2 >>>>>>>>>>> sounds like the immediate next steps ! >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Prashant Singh >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 16, 2025 at 10:42 AM Kevin Liu < >>>>>>>>>>> kevinjq...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Prashant, >>>>>>>>>>>> >>>>>>>>>>>> This sounds like a good reason to do a patch release. I'm +1 >>>>>>>>>>>> >>>>>>>>>>>> Do you know if this is also affecting other minor versions? Do >>>>>>>>>>>> we also need to patch 1.8.x? >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Kevin Liu >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:30 AM Prashant Singh < >>>>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hey all, >>>>>>>>>>>>> >>>>>>>>>>>>> A couple of users have recently reported a *critical table >>>>>>>>>>>>> corruption issue* when using the Iceberg REST catalog client. >>>>>>>>>>>>> This issue has existed from the beginning and can result in >>>>>>>>>>>>> *committed >>>>>>>>>>>>> snapshot data being incorrectly deleted*, effectively >>>>>>>>>>>>> rendering the table unusable (i.e., corrupted). The root cause is >>>>>>>>>>>>> *self-conflict >>>>>>>>>>>>> during internal HTTP client retries*. >>>>>>>>>>>>> >>>>>>>>>>>>> More context can be found here: >>>>>>>>>>>>> [1] >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818#issue-3000325526 >>>>>>>>>>>>> [2] >>>>>>>>>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>>>>>>>>>>> >>>>>>>>>>>>> We’ve seen this issue hit by users in the wider community and >>>>>>>>>>>>> also internally on our end. I’ve submitted a fix here (now >>>>>>>>>>>>> merged) : >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818 >>>>>>>>>>>>> >>>>>>>>>>>>> Given the *severity* and the *broad version impact*, I’d like >>>>>>>>>>>>> to propose cutting a *1.9.2 patch release* to include this >>>>>>>>>>>>> fix. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm also happy to *volunteer as the release manager* for this >>>>>>>>>>>>> release. >>>>>>>>>>>>> >>>>>>>>>>>>> Looking forward to hearing the community's thoughts. >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> Prashant Singh >>>>>>>>>>>>> >>>>>>>>>>>>