If the 1.9.2 release doesn't fix the issue that users were hitting, which if I understand correctly was a 503, do we need to do a patch release? Are users hitting 502 and 504 and need to stop the retry?
On Mon, Jul 7, 2025 at 3:11 PM Prashant Singh <prashant010...@gmail.com> wrote: > Thanks for the background, Dennis ! Thanks Ryan, It seems like we are ok > in treating all 5xx errors as CommitStateUnknown and hence not retry, > considering what spec says about 502 > <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070>, > 504 > <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1084> > vs 503 > <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L4648>, > I think it requires spec changes considering we now require 503 to throw > CommitStateUnknown. > > As 1.9.2 is a patch release, from my understanding, it should not cause > behaviour changes, as treating 503 the same as 502 / 504 causes behaviour > changes, IMHO ! > > Given that PR-13352 <https://github.com/apache/iceberg/pull/13352> got > in, I started a vote thread for the same. > > Best, > Prashant Singh > > On Thu, Jun 26, 2025 at 3:12 PM Ryan Blue <rdb...@gmail.com> wrote: > >> Thanks for all the background, Dennis! I think I agree that we should >> treat all 5xx errors as CommitStateUnknown because we don't necessarily >> know what was passed to the service. We would need to update the spec, >> right? >> >> On Thu, Jun 26, 2025 at 3:03 PM Dennis Huo <huoi...@gmail.com> wrote: >> >>> Discussed offline with various folks who were interested in more details >>> about what kinds of standard infra components use 503 for things that >>> should be interpreted as "CommitStateUnknown", so I dug into some common >>> open-source components to track down common sources of such 503s. >>> >>> Two in particular seem to provide the best indication as to the >>> non-retriability for 503s from mutation requests: >>> >>> 1. Istio itself added a flag for its own retry policy default behaviors >>> on 503: >>> https://github.com/istio/istio/blob/8132a36e67380f5a171937f203f31ff27ebae86d/releasenotes/notes/50506.yaml#L9 >>> >>> **Fixed** default retry policy to exclude retries on 503 which is >>>> potentially unsafe for idempotent requets. >>>> It can be reverted by setting >>>> "EXCLUDE_UNSAFE_503_FROM_DEFAULT_RETRY" to false. Please note that this >>>> flag is >>>> only available for backward compatibility reasons and will be >>>> removed in future releases >>> >>> >>> 2. Based on some logs that showed 503 coming from >>> "upstream_reset_before_response_started", this is at least one of the >>> places Envoy does it: >>> >>> >>> https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/envoy/stream_info/stream_info.h#L216 >>> >>> defines EarlyUpstreamReset: >>> >>> // The upstream connection was reset before a response was started. >>>> This >>>> // will generally be accompanied by details about why the reset >>>> occurred. >>>> const std::string EarlyUpstreamReset = >>>> "upstream_reset_before_response_started"; >>> >>> >>> This is used here: >>> https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/source/common/router/router.cc#L996 >>> >>> >>>> // not buffering any data for retry, shadow, and internal >>>> redirect, and there will be >>>> // no more upstream request, abort the request and clean up. >>>> cleanup(); >>>> callbacks_->sendLocalReply( >>>> Http::Code::ServiceUnavailable, >>>> "upstream is closed prematurely during decoding data from >>>> downstream", modify_headers_, >>>> absl::nullopt, >>>> StreamInfo::ResponseCodeDetails::get().EarlyUpstreamReset); >>>> return Http::FilterDataStatus::StopIterationNoBuffer; >>> >>> >>> On Wed, Jun 18, 2025 at 9:43 PM Dennis Huo <huoi...@gmail.com> wrote: >>> >>>> Sounds like the confusion is indeed stemming from anchoring on >>>> behaviors of 502 and 504 as established prior art while not applying the >>>> same initial decision factors to 503. >>>> >>>> If we don't trust 5xx error codes, then wouldn't it make the most sense >>>>> to throw CommitStateUnknown for any 5xx response? >>>> >>>> >>>> I think this is the crux of the issue, and I agree with this statement >>>> entirely. It appears most prudent to treat any 5xx response as something >>>> that requires higher-level reconciliation of state before attempting and >>>> retries, so in general: >>>> >>>> 1. Any low-level client retries should not happen in response to 5xx >>>> errors >>>> 2. These should propagate out as CommitStateUnknown consistently >>>> >>>> I think there's some variation in how folks are interpreting the >>>> meaning of being unable to "trust 5xx error codes" though -- certainly, if >>>> 503 is more universally understood to mean a *guarantee* that the backend >>>> *did not* and *will not* process the contents of the request, then we're >>>> talking about "not trusting that the server returned the correct 5xx error >>>> code". >>>> >>>> On the other hand, if the 503 RFC is understood to *not make >>>> guarantees* about whether the backend did or did not start processing >>>> contents, then we're talking about "don't trust your assumptions about >>>> server state after receiving a 503". >>>> >>>> Based on just a cursory search from different service domains, it looks >>>> like 503 does not generally provide guarantees about no processing. For >>>> example: >>>> >>>> 1. At least one managed service observes the Istio layer returning 503s >>>> during a maintenance event when the underlying k8s application did not yet >>>> produce a response at all >>>> 2. >>>> https://www.hostwinds.com/blog/503-error-common-causes-and-how-to-fix-it >>>> lists "3. Backend Fetch Failed", "4. Connection Timeout", and "7. Out of >>>> Server Resources" as 503 causes, all of which are semantically closer to a >>>> 502 than a 429 (where 429 is more similar to the listed "Server Overloaded" >>>> or "Server Under Maintenance" items) >>>> 3. >>>> https://support.google.com/business/thread/10553916/started-getting-this-error-today-googlejsonresponseexception-503-service-unavailable?hl=en >>>> Shows some APIs explicitly stating 503 as a "partially processed" state: >>>> >>>> "code" : 503, >>>> "errors" : [ { >>>> "domain" : "global", >>>> "message" : "The service encountered a transient error. Part of the >>>> request may have succeeeded. Refer to details for which fields failed.", >>>> "reason" : "backendError" >>>> } ], >>>> >>>> My two cents would be to go ahead with consistent handling of 500, 502, >>>> 503, and 504 at least. >>>> >>>> 501 and 505 might be universally understood to be fast-rejections, but >>>> I don't feel strongly about those. >>>> >>>> On Wed, Jun 18, 2025 at 6:03 PM Prashant Singh < >>>> prashant010...@gmail.com> wrote: >>>> >>>>> EDIT: corrected some typos: >>>>> >>>>> Hey Ryan, >>>>> My reasoning was the following: we retry on 502 / 504 as well here >>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, >>>>> we have a spec definition stating here >>>>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> >>>>> that >>>>> it can be unknown, so says the RFC >>>>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> >>>>> as something went wrong in the middle of processing the request. So we >>>>> retry with 502 and 504 we can conflict with ourselves, as we don't know >>>>> when we receive the status of 429 409 is it because we retried on >>>>> 502 and then got 429 409 or something else happened, so I thought >>>>> it's better to throw the commit state unknown if we know the commit has >>>>> been retried. >>>>> >>>>> The Iceberg users who accompanied this was with 504 : slack thread - >>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>>> >>>>> We have similar handling for glue as we did here >>>>> https://github.com/apache/iceberg/pull/7198 as there are internal >>>>> http retries over the glue SDK. >>>>> >>>>> and hence we did it this way. I do agree 429 / 503 is unnecessary >>>>> being wrapped and we can miss the opportunity to clean, but it doesn't >>>>> hurt as alternative is we can capture all the codes we got and inspect >>>>> the >>>>> permutation and combination of the status code, just wanted to avoid that, >>>>> plus i admit there was a bug in our service which we are chasing and >>>>> fixing >>>>> as well where its throwing 503 instead of 502.... but the thing is even >>>>> with that the problem issue will still exist due to retries if they are >>>>> not >>>>> handled properly. >>>>> >>>>> That was the thought process of mine while doing that, and happy to >>>>> incorporate your feedback : >>>>> >>>>> Maybe another alternative is that we take out 502 / 504 from the retry >>>>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501 >>>>> are the final status code ? rest all are unknown >>>>> >>>>> Best, >>>>> Prashant Singh >>>>> >>>>> On Wed, Jun 18, 2025 at 4:21 PM Prashant Singh < >>>>> prashant010...@gmail.com> wrote: >>>>> >>>>>> Hey Ryan, >>>>>> My reasoning was the following: we retry on 502 / 504 as well here >>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, >>>>>> we have a spec definition stating here >>>>>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> >>>>>> that >>>>>> it can be unknown, so says the RFC >>>>>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> >>>>>> as something went wrong in the middle of processing the request. So we >>>>>> retry with 502 and 504 we can conflict with ourselves, as we don't know >>>>>> when we receive the status of 429 is it because we retried on 502 and >>>>>> then >>>>>> got 429 or something else happened, so I thought it's better to throw the >>>>>> commit state unknown if we know the commit has been retried. >>>>>> >>>>>> The Iceberg users who accompanied this was with 504 : slack thread - >>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>>>> >>>>>> We have similar handling for glue as we did here >>>>>> https://github.com/apache/iceberg/pull/7198 as there are internal >>>>>> http retries over the glue SDK. >>>>>> >>>>>> and hence we did it this way. I do agree 429 / 503 is unnecessary >>>>>> being wrapped and we can miss the opportunity to clean, but it doesn't >>>>>> hurt as alternative is we can capture all the codes we got and inspect >>>>>> the >>>>>> permutation and combination of the status code, just wanted to avoid >>>>>> that, >>>>>> plus i admit there was a bug in our service which we are chasing and >>>>>> fixing >>>>>> as well where its throwing 503 instead of 502.... but the thing is even >>>>>> with that the problem issue will still exist due to retries if they are >>>>>> not >>>>>> handled properly. >>>>>> >>>>>> That was the thought process of mine while doing that, and happy to >>>>>> incorporate your feedback : >>>>>> >>>>>> Maybe another alternative is that we take out 502 / 504 from the >>>>>> retry code and just retry on 429 / 503 and clean up only when 409 or 503 >>>>>> or >>>>>> 501 are the final status code ? rest all are unknown >>>>>> >>>>>> Best, >>>>>> Prashant Singh >>>>>> >>>>>> On Wed, Jun 18, 2025 at 12:22 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>>> >>>>>>> Are we confident that the REST client fix is the right path? >>>>>>> >>>>>>> If I understand correctly, the issue is that there is a load >>>>>>> balancer that is converting a 500 response to a 503 that the client will >>>>>>> automatically retry, even though the 500 response is not retry-able. >>>>>>> Then >>>>>>> the retry results in a 409 because the commit conflicts with itself. >>>>>>> >>>>>>> I'm fairly concerned about a fix like this that is operating on the >>>>>>> assumption that we cannot trust the 5xx error code that was sent to the >>>>>>> client. If we don't trust 5xx error codes, then wouldn't it make the >>>>>>> most >>>>>>> sense to throw CommitStateUnknown for any 5xx response? >>>>>>> >>>>>>> On Tue, Jun 17, 2025 at 1:00 PM Russell Spitzer < >>>>>>> russell.spit...@gmail.com> wrote: >>>>>>> >>>>>>>> Sorry I didn't get to reply here, I think the fix Ajantha is >>>>>>>> contributing is extremely important but probably out of scope for a >>>>>>>> patch >>>>>>>> release. Because it will take a bit of manual intervention to fix after >>>>>>>> jumping to the next version I think we should save this for 1.10.0 >>>>>>>> which >>>>>>>> also should come out pretty soon. >>>>>>>> >>>>>>>> On Tue, Jun 17, 2025 at 1:16 PM Prashant Singh < >>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thank you for confirming over slack Ajantha, I also double checked >>>>>>>>> with Russell offline, this seems to be a behaviour change which can't >>>>>>>>> go in >>>>>>>>> a patch release, maybe 1.10 then. So I think we should be good for >>>>>>>>> now. >>>>>>>>> That being said, I will start working on getting a RC out with >>>>>>>>> just this commit >>>>>>>>> <https://github.com/singhpk234/iceberg/commit/14ba08e672df5061f8ac0978ea81f10535da2246> >>>>>>>>> on top of 1.9.1, and will start a thread for that accordingly ! >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Prashant Singh >>>>>>>>> >>>>>>>>> On Tue, Jun 17, 2025 at 8:44 AM Prashant Singh < >>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hey Ajantha, >>>>>>>>>> >>>>>>>>>> I was going to wait for 2-3 days before cutting an RC anyway :), >>>>>>>>>> to see if anyone has an objection or some more *critical* >>>>>>>>>> changes to get in. Thank you for the heads up ! >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Prashant Singh >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jun 16, 2025 at 11:02 AM Ajantha Bhat < >>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I have a PR that needs to be included for 1.9.2 as well! >>>>>>>>>>> >>>>>>>>>>> I was about to start a release discussion for 1.9.2 tomorrow. >>>>>>>>>>> >>>>>>>>>>> It is from an oversight during partition stats refactoring (in >>>>>>>>>>> 1.9.0). We missed that field ids are tracked in spec during >>>>>>>>>>> refactoring! >>>>>>>>>>> Because of it, schema field ids are not as per partition stats spec. >>>>>>>>>>> Solution is to implicitly recompute stats when old stats are >>>>>>>>>>> detected with >>>>>>>>>>> invalid field id. Including the fix in 1.9.2 can reduce the issue >>>>>>>>>>> impact as >>>>>>>>>>> new stats will have valid schema ids and also handle corrupted >>>>>>>>>>> stats. >>>>>>>>>>> Feature is not widely integrated yet and engine integrations are in >>>>>>>>>>> progress for 1.10.0. But we still want to reduce the impact of the >>>>>>>>>>> issue. I >>>>>>>>>>> have chatted with Peter and Russell about this. >>>>>>>>>>> >>>>>>>>>>> PR will be shared tomorrow. >>>>>>>>>>> >>>>>>>>>>> Just giving heads up to wait a day or two for a release cut. >>>>>>>>>>> >>>>>>>>>>> - Ajantha >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 16, 2025 at 11:25 PM Steven Wu <stevenz...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> +1 for a 1.9.2 release >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:53 AM Prashant Singh < >>>>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hey Kevin, >>>>>>>>>>>>> This goes well before 1.8, if you will see the issue that my >>>>>>>>>>>>> PR refers to is reported from iceberg 1.7, It has been there >>>>>>>>>>>>> since the >>>>>>>>>>>>> beginning of the IRC client. >>>>>>>>>>>>> We were having similar debates on if we should patch all the >>>>>>>>>>>>> releases, but I think this requires more wider discussions, but >>>>>>>>>>>>> 1.9.2 >>>>>>>>>>>>> sounds like the immediate next steps ! >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Prashant Singh >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:42 AM Kevin Liu < >>>>>>>>>>>>> kevinjq...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Prashant, >>>>>>>>>>>>>> >>>>>>>>>>>>>> This sounds like a good reason to do a patch release. I'm +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do you know if this is also affecting other minor versions? >>>>>>>>>>>>>> Do we also need to patch 1.8.x? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Kevin Liu >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:30 AM Prashant Singh < >>>>>>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hey all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> A couple of users have recently reported a *critical table >>>>>>>>>>>>>>> corruption issue* when using the Iceberg REST catalog >>>>>>>>>>>>>>> client. This issue has existed from the beginning and can >>>>>>>>>>>>>>> result in *committed >>>>>>>>>>>>>>> snapshot data being incorrectly deleted*, effectively >>>>>>>>>>>>>>> rendering the table unusable (i.e., corrupted). The root cause >>>>>>>>>>>>>>> is *self-conflict >>>>>>>>>>>>>>> during internal HTTP client retries*. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> More context can be found here: >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818#issue-3000325526 >>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We’ve seen this issue hit by users in the wider community >>>>>>>>>>>>>>> and also internally on our end. I’ve submitted a fix here (now >>>>>>>>>>>>>>> merged) : >>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Given the *severity* and the *broad version impact*, I’d >>>>>>>>>>>>>>> like to propose cutting a *1.9.2 patch release* to include >>>>>>>>>>>>>>> this fix. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm also happy to *volunteer as the release manager* for >>>>>>>>>>>>>>> this release. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Looking forward to hearing the community's thoughts. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> Prashant Singh >>>>>>>>>>>>>>> >>>>>>>>>>>>>>