Discussed offline with various folks who were interested in more details
about what kinds of standard infra components use 503 for things that
should be interpreted as "CommitStateUnknown", so I dug into some common
open-source components to track down common sources of such 503s.

Two in particular seem to provide the best indication as to the
non-retriability for 503s from mutation requests:

1. Istio itself added a flag for its own retry policy default behaviors on
503:
https://github.com/istio/istio/blob/8132a36e67380f5a171937f203f31ff27ebae86d/releasenotes/notes/50506.yaml#L9

    **Fixed** default retry policy to exclude retries on 503 which is
> potentially unsafe for idempotent requets.
>     It can be reverted by setting "EXCLUDE_UNSAFE_503_FROM_DEFAULT_RETRY"
> to false. Please note that this flag is
>     only available for backward compatibility reasons and will be removed
> in future releases


2. Based on some logs that showed 503 coming from
"upstream_reset_before_response_started", this is at least one of the
places Envoy does it:

https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/envoy/stream_info/stream_info.h#L216

defines EarlyUpstreamReset:

  // The upstream connection was reset before a response was started. This
>   // will generally be accompanied by details about why the reset occurred.
>   const std::string EarlyUpstreamReset =
> "upstream_reset_before_response_started";


This is used here:
https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/source/common/router/router.cc#L996


>       // not buffering any data for retry, shadow, and internal redirect,
> and there will be
>       // no more upstream request, abort the request and clean up.
>       cleanup();
>       callbacks_->sendLocalReply(
>           Http::Code::ServiceUnavailable,
>           "upstream is closed prematurely during decoding data from
> downstream", modify_headers_,
>           absl::nullopt,
> StreamInfo::ResponseCodeDetails::get().EarlyUpstreamReset);
>       return Http::FilterDataStatus::StopIterationNoBuffer;


On Wed, Jun 18, 2025 at 9:43 PM Dennis Huo <huoi...@gmail.com> wrote:

> Sounds like the confusion is indeed stemming from anchoring on behaviors
> of 502 and 504 as established prior art while not applying the same initial
> decision factors to 503.
>
> If we don't trust 5xx error codes, then wouldn't it make the most sense to
>> throw CommitStateUnknown for any 5xx response?
>
>
> I think this is the crux of the issue, and I agree with this statement
> entirely. It appears most prudent to treat any 5xx response as something
> that requires higher-level reconciliation of state before attempting and
> retries, so in general:
>
> 1. Any low-level client retries should not happen in response to 5xx errors
> 2. These should propagate out as CommitStateUnknown consistently
>
> I think there's some variation in how folks are interpreting the meaning
> of being unable to "trust 5xx error codes" though -- certainly, if 503 is
> more universally understood to mean a *guarantee* that the backend *did
> not* and *will not* process the contents of the request, then we're
> talking about "not trusting that the server returned the correct 5xx error
> code".
>
> On the other hand, if the 503 RFC is understood to *not make guarantees*
> about whether the backend did or did not start processing contents, then
> we're talking about "don't trust your assumptions about server state after
> receiving a 503".
>
> Based on just a cursory search from different service domains, it looks
> like 503 does not generally provide guarantees about no processing. For
> example:
>
> 1. At least one managed service observes the Istio layer returning 503s
> during a maintenance event when the underlying k8s application did not yet
> produce a response at all
> 2.
> https://www.hostwinds.com/blog/503-error-common-causes-and-how-to-fix-it
> lists "3. Backend Fetch Failed", "4. Connection Timeout", and "7. Out of
> Server Resources" as 503 causes, all of which are semantically closer to a
> 502 than a 429 (where 429 is more similar to the listed "Server Overloaded"
> or "Server Under Maintenance" items)
> 3.
> https://support.google.com/business/thread/10553916/started-getting-this-error-today-googlejsonresponseexception-503-service-unavailable?hl=en
> Shows some APIs explicitly stating 503 as a "partially processed" state:
>
>   "code" : 503,
>   "errors" : [ {
>     "domain" : "global",
>     "message" : "The service encountered a transient error. Part of the
> request may have succeeeded. Refer to details for which fields failed.",
>     "reason" : "backendError"
>   } ],
>
> My two cents would be to go ahead with consistent handling of 500, 502,
> 503, and 504 at least.
>
> 501 and 505 might be universally understood to be fast-rejections, but I
> don't feel strongly about those.
>
> On Wed, Jun 18, 2025 at 6:03 PM Prashant Singh <prashant010...@gmail.com>
> wrote:
>
>> EDIT: corrected some typos:
>>
>> Hey Ryan,
>> My reasoning was the following: we retry on 502 / 504 as well here
>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>,
>> we have a spec definition stating here
>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070>
>>  that
>> it can be unknown, so says the RFC
>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway>  as
>> something went wrong in the middle of processing the request. So we retry
>> with 502 and 504 we can conflict with ourselves, as we don't know when we
>> receive the status of 429  409 is it because we retried on 502 and then
>> got 429  409 or something else happened, so I thought it's better to
>> throw the commit state unknown if we know the commit has been retried.
>>
>> The Iceberg users who accompanied this was with 504 : slack thread -
>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219
>>
>> We have similar handling for glue as we did here
>> https://github.com/apache/iceberg/pull/7198  as there are internal http
>> retries over the glue SDK.
>>
>> and hence we did it this way. I do agree 429 / 503 is unnecessary being
>> wrapped and we can miss the opportunity to clean, but it doesn't hurt  as
>> alternative is we can capture all the codes we got and inspect the
>> permutation and combination of the status code, just wanted to avoid that,
>> plus i admit there was a bug in our service which we are chasing and fixing
>> as well where its throwing 503 instead of 502.... but the thing is even
>> with that the problem issue will still exist due to retries if they are not
>> handled properly.
>>
>> That was the thought process of mine while doing that, and happy to
>> incorporate your feedback :
>>
>> Maybe another alternative is that we take out 502 / 504 from the retry
>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501
>> are the final status code ? rest all are unknown
>>
>> Best,
>> Prashant Singh
>>
>> On Wed, Jun 18, 2025 at 4:21 PM Prashant Singh <prashant010...@gmail.com>
>> wrote:
>>
>>> Hey Ryan,
>>> My reasoning was the following: we retry on 502 / 504 as well here
>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>,
>>> we have a spec definition stating here
>>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070>
>>>  that
>>> it can be unknown, so says the RFC
>>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway>
>>> as something went wrong in the middle of processing the request. So we
>>> retry with 502 and 504 we can conflict with ourselves, as we don't know
>>> when we receive the status of 429 is it because we retried on 502 and then
>>> got 429 or something else happened, so I thought it's better to throw the
>>> commit state unknown if we know the commit has been retried.
>>>
>>> The Iceberg users who accompanied this was with 504 : slack thread -
>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219
>>>
>>> We have similar handling for glue as we did here
>>> https://github.com/apache/iceberg/pull/7198  as there are internal http
>>> retries over the glue SDK.
>>>
>>> and hence we did it this way. I do agree 429 / 503 is unnecessary being
>>> wrapped and we can miss the opportunity to clean, but it doesn't hurt  as
>>> alternative is we can capture all the codes we got and inspect the
>>> permutation and combination of the status code, just wanted to avoid that,
>>> plus i admit there was a bug in our service which we are chasing and fixing
>>> as well where its throwing 503 instead of 502.... but the thing is even
>>> with that the problem issue will still exist due to retries if they are not
>>> handled properly.
>>>
>>> That was the thought process of mine while doing that, and happy to
>>> incorporate your feedback :
>>>
>>> Maybe another alternative is that we take out 502 / 504 from the retry
>>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501
>>> are the final status code ? rest all are unknown
>>>
>>> Best,
>>> Prashant Singh
>>>
>>> On Wed, Jun 18, 2025 at 12:22 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>
>>>> Are we confident that the REST client fix is the right path?
>>>>
>>>> If I understand correctly, the issue is that there is a load balancer
>>>> that is converting a 500 response to a 503 that the client will
>>>> automatically retry, even though the 500 response is not retry-able. Then
>>>> the retry results in a 409 because the commit conflicts with itself.
>>>>
>>>> I'm fairly concerned about a fix like this that is operating on the
>>>> assumption that we cannot trust the 5xx error code that was sent to the
>>>> client. If we don't trust 5xx error codes, then wouldn't it make the most
>>>> sense to throw CommitStateUnknown for any 5xx response?
>>>>
>>>> On Tue, Jun 17, 2025 at 1:00 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> Sorry I didn't get to reply here, I think the fix Ajantha is
>>>>> contributing is extremely important but probably out of scope for a patch
>>>>> release. Because it will take a bit of manual intervention to fix after
>>>>> jumping to the next version I think we should save this for 1.10.0 which
>>>>> also should come out pretty soon.
>>>>>
>>>>> On Tue, Jun 17, 2025 at 1:16 PM Prashant Singh <
>>>>> prashant010...@gmail.com> wrote:
>>>>>
>>>>>> Thank you for confirming over slack Ajantha, I also double checked
>>>>>> with Russell offline, this seems to be a behaviour change which can't go 
>>>>>> in
>>>>>> a patch release, maybe 1.10 then. So I think we should be good for now.
>>>>>> That being said, I will start working on getting a RC out with just
>>>>>> this commit
>>>>>> <https://github.com/singhpk234/iceberg/commit/14ba08e672df5061f8ac0978ea81f10535da2246>
>>>>>> on top of 1.9.1, and will start a thread for that accordingly !
>>>>>>
>>>>>> Best,
>>>>>> Prashant Singh
>>>>>>
>>>>>> On Tue, Jun 17, 2025 at 8:44 AM Prashant Singh <
>>>>>> prashant010...@gmail.com> wrote:
>>>>>>
>>>>>>> Hey Ajantha,
>>>>>>>
>>>>>>> I was going to wait for 2-3 days before cutting an RC anyway :),  to
>>>>>>> see if anyone has an objection or some more *critical* changes to
>>>>>>> get in. Thank you for the heads up !
>>>>>>>
>>>>>>> Best,
>>>>>>> Prashant Singh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 16, 2025 at 11:02 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have a PR that needs to be included for 1.9.2 as well!
>>>>>>>>
>>>>>>>> I was about to start a release discussion for 1.9.2 tomorrow.
>>>>>>>>
>>>>>>>> It is from an oversight during partition stats refactoring (in
>>>>>>>> 1.9.0). We missed that field ids are tracked in spec during 
>>>>>>>> refactoring!
>>>>>>>> Because of it, schema field ids are not as per partition stats spec.
>>>>>>>> Solution is to implicitly recompute stats when old stats are detected 
>>>>>>>> with
>>>>>>>> invalid field id. Including the fix in 1.9.2 can reduce the issue 
>>>>>>>> impact as
>>>>>>>> new stats will have valid schema ids and also handle corrupted stats.
>>>>>>>> Feature is not widely integrated yet and engine integrations are in
>>>>>>>> progress for 1.10.0. But we still want to reduce the impact of the 
>>>>>>>> issue. I
>>>>>>>> have chatted with Peter and Russell about this.
>>>>>>>>
>>>>>>>> PR will be shared tomorrow.
>>>>>>>>
>>>>>>>> Just giving heads up to wait a day or two for a release cut.
>>>>>>>>
>>>>>>>> - Ajantha
>>>>>>>>
>>>>>>>> On Mon, Jun 16, 2025 at 11:25 PM Steven Wu <stevenz...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 for a 1.9.2 release
>>>>>>>>>
>>>>>>>>> On Mon, Jun 16, 2025 at 10:53 AM Prashant Singh <
>>>>>>>>> prashant010...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Kevin,
>>>>>>>>>> This goes well before 1.8, if you will see the issue that my PR
>>>>>>>>>> refers to is reported from iceberg 1.7, It has been there since the
>>>>>>>>>> beginning of the IRC client.
>>>>>>>>>> We were having similar debates on if we should patch all the
>>>>>>>>>> releases, but I think this requires more wider discussions, but 1.9.2
>>>>>>>>>> sounds like the immediate next steps !
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Prashant Singh
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 16, 2025 at 10:42 AM Kevin Liu <kevinjq...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Prashant,
>>>>>>>>>>>
>>>>>>>>>>> This sounds like a good reason to do a patch release. I'm +1
>>>>>>>>>>>
>>>>>>>>>>> Do you know if this is also affecting other minor versions? Do
>>>>>>>>>>> we also need to patch 1.8.x?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:30 AM Prashant Singh <
>>>>>>>>>>> prashant010...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>
>>>>>>>>>>>> A couple of users have recently reported a *critical table
>>>>>>>>>>>> corruption issue* when using the Iceberg REST catalog client.
>>>>>>>>>>>> This issue has existed from the beginning and can result in 
>>>>>>>>>>>> *committed
>>>>>>>>>>>> snapshot data being incorrectly deleted*, effectively
>>>>>>>>>>>> rendering the table unusable (i.e., corrupted). The root cause is 
>>>>>>>>>>>> *self-conflict
>>>>>>>>>>>> during internal HTTP client retries*.
>>>>>>>>>>>>
>>>>>>>>>>>> More context can be found here:
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818#issue-3000325526
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219
>>>>>>>>>>>>
>>>>>>>>>>>> We’ve seen this issue hit by users in the wider community and
>>>>>>>>>>>> also internally on our end. I’ve submitted a fix here (now merged) 
>>>>>>>>>>>> :
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818
>>>>>>>>>>>>
>>>>>>>>>>>> Given the *severity* and the *broad version impact*, I’d like
>>>>>>>>>>>> to propose cutting a *1.9.2 patch release* to include this fix.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm also happy to *volunteer as the release manager* for this
>>>>>>>>>>>> release.
>>>>>>>>>>>>
>>>>>>>>>>>> Looking forward to hearing the community's thoughts.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Prashant Singh
>>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to