Re: [DISCUSS] Proposal for Iceberg 1.9.2 Release to Fix Critical REST Client Issue

Prashant Singh Mon, 07 Jul 2025 15:11:44 -0700

Thanks for the background, Dennis ! Thanks Ryan, It seems like we are ok in
treating all 5xx errors as CommitStateUnknown and hence not retry,
considering what spec says about 502
<https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070>,
504
<https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1084>
vs 503
<https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L4648>,
I think it requires spec changes considering we now require 503 to throw
CommitStateUnknown.


As 1.9.2 is a patch release, from my understanding, it should not cause
behaviour changes, as treating 503 the same as 502 / 504 causes behaviour
changes, IMHO !

Given that PR-13352 <https://github.com/apache/iceberg/pull/13352> got in,
I started a vote thread for the same.

Best,
Prashant Singh

On Thu, Jun 26, 2025 at 3:12 PM Ryan Blue <rdb...@gmail.com> wrote:

> Thanks for all the background, Dennis! I think I agree that we should
> treat all 5xx errors as CommitStateUnknown because we don't necessarily
> know what was passed to the service. We would need to update the spec,
> right?
>
> On Thu, Jun 26, 2025 at 3:03 PM Dennis Huo <huoi...@gmail.com> wrote:
>
>> Discussed offline with various folks who were interested in more details
>> about what kinds of standard infra components use 503 for things that
>> should be interpreted as "CommitStateUnknown", so I dug into some common
>> open-source components to track down common sources of such 503s.
>>
>> Two in particular seem to provide the best indication as to the
>> non-retriability for 503s from mutation requests:
>>
>> 1. Istio itself added a flag for its own retry policy default behaviors
>> on 503:
>> https://github.com/istio/istio/blob/8132a36e67380f5a171937f203f31ff27ebae86d/releasenotes/notes/50506.yaml#L9
>>
>>     **Fixed** default retry policy to exclude retries on 503 which is
>>> potentially unsafe for idempotent requets.
>>>     It can be reverted by setting
>>> "EXCLUDE_UNSAFE_503_FROM_DEFAULT_RETRY" to false. Please note that this
>>> flag is
>>>     only available for backward compatibility reasons and will be
>>> removed in future releases
>>
>>
>> 2. Based on some logs that showed 503 coming from
>> "upstream_reset_before_response_started", this is at least one of the
>> places Envoy does it:
>>
>>
>> https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/envoy/stream_info/stream_info.h#L216
>>
>> defines EarlyUpstreamReset:
>>
>>   // The upstream connection was reset before a response was started. This
>>>   // will generally be accompanied by details about why the reset
>>> occurred.
>>>   const std::string EarlyUpstreamReset =
>>> "upstream_reset_before_response_started";
>>
>>
>> This is used here:
>> https://github.com/envoyproxy/envoy/blob/87d4c6c372f535ee3e2077526638a696b67bde84/source/common/router/router.cc#L996
>>
>>
>>>       // not buffering any data for retry, shadow, and internal
>>> redirect, and there will be
>>>       // no more upstream request, abort the request and clean up.
>>>       cleanup();
>>>       callbacks_->sendLocalReply(
>>>           Http::Code::ServiceUnavailable,
>>>           "upstream is closed prematurely during decoding data from
>>> downstream", modify_headers_,
>>>           absl::nullopt,
>>> StreamInfo::ResponseCodeDetails::get().EarlyUpstreamReset);
>>>       return Http::FilterDataStatus::StopIterationNoBuffer;
>>
>>
>> On Wed, Jun 18, 2025 at 9:43 PM Dennis Huo <huoi...@gmail.com> wrote:
>>
>>> Sounds like the confusion is indeed stemming from anchoring on behaviors
>>> of 502 and 504 as established prior art while not applying the same initial
>>> decision factors to 503.
>>>
>>> If we don't trust 5xx error codes, then wouldn't it make the most sense
>>>> to throw CommitStateUnknown for any 5xx response?
>>>
>>>
>>> I think this is the crux of the issue, and I agree with this statement
>>> entirely. It appears most prudent to treat any 5xx response as something
>>> that requires higher-level reconciliation of state before attempting and
>>> retries, so in general:
>>>
>>> 1. Any low-level client retries should not happen in response to 5xx
>>> errors
>>> 2. These should propagate out as CommitStateUnknown consistently
>>>
>>> I think there's some variation in how folks are interpreting the meaning
>>> of being unable to "trust 5xx error codes" though -- certainly, if 503 is
>>> more universally understood to mean a *guarantee* that the backend *did
>>> not* and *will not* process the contents of the request, then we're
>>> talking about "not trusting that the server returned the correct 5xx error
>>> code".
>>>
>>> On the other hand, if the 503 RFC is understood to *not make guarantees*
>>> about whether the backend did or did not start processing contents, then
>>> we're talking about "don't trust your assumptions about server state after
>>> receiving a 503".
>>>
>>> Based on just a cursory search from different service domains, it looks
>>> like 503 does not generally provide guarantees about no processing. For
>>> example:
>>>
>>> 1. At least one managed service observes the Istio layer returning 503s
>>> during a maintenance event when the underlying k8s application did not yet
>>> produce a response at all
>>> 2.
>>> https://www.hostwinds.com/blog/503-error-common-causes-and-how-to-fix-it
>>> lists "3. Backend Fetch Failed", "4. Connection Timeout", and "7. Out of
>>> Server Resources" as 503 causes, all of which are semantically closer to a
>>> 502 than a 429 (where 429 is more similar to the listed "Server Overloaded"
>>> or "Server Under Maintenance" items)
>>> 3.
>>> https://support.google.com/business/thread/10553916/started-getting-this-error-today-googlejsonresponseexception-503-service-unavailable?hl=en
>>> Shows some APIs explicitly stating 503 as a "partially processed" state:
>>>
>>>   "code" : 503,
>>>   "errors" : [ {
>>>     "domain" : "global",
>>>     "message" : "The service encountered a transient error. Part of the
>>> request may have succeeeded. Refer to details for which fields failed.",
>>>     "reason" : "backendError"
>>>   } ],
>>>
>>> My two cents would be to go ahead with consistent handling of 500, 502,
>>> 503, and 504 at least.
>>>
>>> 501 and 505 might be universally understood to be fast-rejections, but I
>>> don't feel strongly about those.
>>>
>>> On Wed, Jun 18, 2025 at 6:03 PM Prashant Singh <prashant010...@gmail.com>
>>> wrote:
>>>
>>>> EDIT: corrected some typos:
>>>>
>>>> Hey Ryan,
>>>> My reasoning was the following: we retry on 502 / 504 as well here
>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>,
>>>> we have a spec definition stating here
>>>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070>
>>>>  that
>>>> it can be unknown, so says the RFC
>>>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway>
>>>> as something went wrong in the middle of processing the request. So we
>>>> retry with 502 and 504 we can conflict with ourselves, as we don't know
>>>> when we receive the status of 429  409 is it because we retried on 502
>>>> and then got 429  409 or something else happened, so I thought it's
>>>> better to throw the commit state unknown if we know the commit has been
>>>> retried.
>>>>
>>>> The Iceberg users who accompanied this was with 504 : slack thread -
>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219
>>>>
>>>> We have similar handling for glue as we did here
>>>> https://github.com/apache/iceberg/pull/7198  as there are internal
>>>> http retries over the glue SDK.
>>>>
>>>> and hence we did it this way. I do agree 429 / 503 is unnecessary being
>>>> wrapped and we can miss the opportunity to clean, but it doesn't hurt  as
>>>> alternative is we can capture all the codes we got and inspect the
>>>> permutation and combination of the status code, just wanted to avoid that,
>>>> plus i admit there was a bug in our service which we are chasing and fixing
>>>> as well where its throwing 503 instead of 502.... but the thing is even
>>>> with that the problem issue will still exist due to retries if they are not
>>>> handled properly.
>>>>
>>>> That was the thought process of mine while doing that, and happy to
>>>> incorporate your feedback :
>>>>
>>>> Maybe another alternative is that we take out 502 / 504 from the retry
>>>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501
>>>> are the final status code ? rest all are unknown
>>>>
>>>> Best,
>>>> Prashant Singh
>>>>
>>>> On Wed, Jun 18, 2025 at 4:21 PM Prashant Singh <
>>>> prashant010...@gmail.com> wrote:
>>>>
>>>>> Hey Ryan,
>>>>> My reasoning was the following: we retry on 502 / 504 as well here
>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>,
>>>>> we have a spec definition stating here
>>>>> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070>
>>>>>  that
>>>>> it can be unknown, so says the RFC
>>>>> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway>
>>>>> as something went wrong in the middle of processing the request. So we
>>>>> retry with 502 and 504 we can conflict with ourselves, as we don't know
>>>>> when we receive the status of 429 is it because we retried on 502 and then
>>>>> got 429 or something else happened, so I thought it's better to throw the
>>>>> commit state unknown if we know the commit has been retried.
>>>>>
>>>>> The Iceberg users who accompanied this was with 504 : slack thread -
>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219
>>>>>
>>>>> We have similar handling for glue as we did here
>>>>> https://github.com/apache/iceberg/pull/7198  as there are internal
>>>>> http retries over the glue SDK.
>>>>>
>>>>> and hence we did it this way. I do agree 429 / 503 is unnecessary
>>>>> being wrapped and we can miss the opportunity to clean, but it doesn't
>>>>> hurt  as alternative is we can capture all the codes we got and inspect 
>>>>> the
>>>>> permutation and combination of the status code, just wanted to avoid that,
>>>>> plus i admit there was a bug in our service which we are chasing and 
>>>>> fixing
>>>>> as well where its throwing 503 instead of 502.... but the thing is even
>>>>> with that the problem issue will still exist due to retries if they are 
>>>>> not
>>>>> handled properly.
>>>>>
>>>>> That was the thought process of mine while doing that, and happy to
>>>>> incorporate your feedback :
>>>>>
>>>>> Maybe another alternative is that we take out 502 / 504 from the retry
>>>>> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501
>>>>> are the final status code ? rest all are unknown
>>>>>
>>>>> Best,
>>>>> Prashant Singh
>>>>>
>>>>> On Wed, Jun 18, 2025 at 12:22 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>>>
>>>>>> Are we confident that the REST client fix is the right path?
>>>>>>
>>>>>> If I understand correctly, the issue is that there is a load balancer
>>>>>> that is converting a 500 response to a 503 that the client will
>>>>>> automatically retry, even though the 500 response is not retry-able. Then
>>>>>> the retry results in a 409 because the commit conflicts with itself.
>>>>>>
>>>>>> I'm fairly concerned about a fix like this that is operating on the
>>>>>> assumption that we cannot trust the 5xx error code that was sent to the
>>>>>> client. If we don't trust 5xx error codes, then wouldn't it make the most
>>>>>> sense to throw CommitStateUnknown for any 5xx response?
>>>>>>
>>>>>> On Tue, Jun 17, 2025 at 1:00 PM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> Sorry I didn't get to reply here, I think the fix Ajantha is
>>>>>>> contributing is extremely important but probably out of scope for a 
>>>>>>> patch
>>>>>>> release. Because it will take a bit of manual intervention to fix after
>>>>>>> jumping to the next version I think we should save this for 1.10.0 which
>>>>>>> also should come out pretty soon.
>>>>>>>
>>>>>>> On Tue, Jun 17, 2025 at 1:16 PM Prashant Singh <
>>>>>>> prashant010...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you for confirming over slack Ajantha, I also double checked
>>>>>>>> with Russell offline, this seems to be a behaviour change which can't 
>>>>>>>> go in
>>>>>>>> a patch release, maybe 1.10 then. So I think we should be good for now.
>>>>>>>> That being said, I will start working on getting a RC out with just
>>>>>>>> this commit
>>>>>>>> <https://github.com/singhpk234/iceberg/commit/14ba08e672df5061f8ac0978ea81f10535da2246>
>>>>>>>> on top of 1.9.1, and will start a thread for that accordingly !
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Prashant Singh
>>>>>>>>
>>>>>>>> On Tue, Jun 17, 2025 at 8:44 AM Prashant Singh <
>>>>>>>> prashant010...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hey Ajantha,
>>>>>>>>>
>>>>>>>>> I was going to wait for 2-3 days before cutting an RC anyway :),
>>>>>>>>> to see if anyone has an objection or some more *critical* changes
>>>>>>>>> to get in. Thank you for the heads up !
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Prashant Singh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 16, 2025 at 11:02 AM Ajantha Bhat <
>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I have a PR that needs to be included for 1.9.2 as well!
>>>>>>>>>>
>>>>>>>>>> I was about to start a release discussion for 1.9.2 tomorrow.
>>>>>>>>>>
>>>>>>>>>> It is from an oversight during partition stats refactoring (in
>>>>>>>>>> 1.9.0). We missed that field ids are tracked in spec during 
>>>>>>>>>> refactoring!
>>>>>>>>>> Because of it, schema field ids are not as per partition stats spec.
>>>>>>>>>> Solution is to implicitly recompute stats when old stats are 
>>>>>>>>>> detected with
>>>>>>>>>> invalid field id. Including the fix in 1.9.2 can reduce the issue 
>>>>>>>>>> impact as
>>>>>>>>>> new stats will have valid schema ids and also handle corrupted stats.
>>>>>>>>>> Feature is not widely integrated yet and engine integrations are in
>>>>>>>>>> progress for 1.10.0. But we still want to reduce the impact of the 
>>>>>>>>>> issue. I
>>>>>>>>>> have chatted with Peter and Russell about this.
>>>>>>>>>>
>>>>>>>>>> PR will be shared tomorrow.
>>>>>>>>>>
>>>>>>>>>> Just giving heads up to wait a day or two for a release cut.
>>>>>>>>>>
>>>>>>>>>> - Ajantha
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 16, 2025 at 11:25 PM Steven Wu <stevenz...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 for a 1.9.2 release
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:53 AM Prashant Singh <
>>>>>>>>>>> prashant010...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Kevin,
>>>>>>>>>>>> This goes well before 1.8, if you will see the issue that my PR
>>>>>>>>>>>> refers to is reported from iceberg 1.7, It has been there since the
>>>>>>>>>>>> beginning of the IRC client.
>>>>>>>>>>>> We were having similar debates on if we should patch all the
>>>>>>>>>>>> releases, but I think this requires more wider discussions, but 
>>>>>>>>>>>> 1.9.2
>>>>>>>>>>>> sounds like the immediate next steps !
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Prashant Singh
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:42 AM Kevin Liu <
>>>>>>>>>>>> kevinjq...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Prashant,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This sounds like a good reason to do a patch release. I'm +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you know if this is also affecting other minor versions? Do
>>>>>>>>>>>>> we also need to patch 1.8.x?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jun 16, 2025 at 10:30 AM Prashant Singh <
>>>>>>>>>>>>> prashant010...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A couple of users have recently reported a *critical table
>>>>>>>>>>>>>> corruption issue* when using the Iceberg REST catalog
>>>>>>>>>>>>>> client. This issue has existed from the beginning and can result 
>>>>>>>>>>>>>> in *committed
>>>>>>>>>>>>>> snapshot data being incorrectly deleted*, effectively
>>>>>>>>>>>>>> rendering the table unusable (i.e., corrupted). The root cause 
>>>>>>>>>>>>>> is *self-conflict
>>>>>>>>>>>>>> during internal HTTP client retries*.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> More context can be found here:
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818#issue-3000325526
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We’ve seen this issue hit by users in the wider community and
>>>>>>>>>>>>>> also internally on our end. I’ve submitted a fix here (now 
>>>>>>>>>>>>>> merged) :
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12818
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Given the *severity* and the *broad version impact*, I’d
>>>>>>>>>>>>>> like to propose cutting a *1.9.2 patch release* to include
>>>>>>>>>>>>>> this fix.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm also happy to *volunteer as the release manager* for
>>>>>>>>>>>>>> this release.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looking forward to hearing the community's thoughts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Prashant Singh
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [DISCUSS] Proposal for Iceberg 1.9.2 Release to Fix Critical REST Client Issue

Reply via email to