ecerulm commented on issue #21087:
URL: https://github.com/apache/airflow/issues/21087#issuecomment-1121560242
> 410 does not happen due to network issues or connectivity issue between
airflow and kubeapi server.
I have reproduced this locally with minikube and kubernetes python client so
I can assure you it CAN happen due to network issues between airflow and the
k8s api. Let me explain the setup
1. Minikube locally
2. Python script that performs a watch in a while loop just like airflow does
3. Python script connects to minikube via toxiproxy so that I can simulate a
network disconnection (context: minikube2)
4. As soon as I simulate the disconnection, the watch will exit with an
exception `("Connection broken: InvalidChunkLength(got length b'', 0 bytes
read)", InvalidChunkLength(got length b'', 0 bytes read))`
5. The python script remembers the last resource version just like airflow
does
6. the python script retries continuously the watch with the last known
resource_version=5659, failing each time with
`HTTPSConnectionPool(host='127.0.0.1', port=2222): Max retries exceeded with
url: /api/v1/namespaces/testns2/pods?resourceVersion=5659&watch=True (Caused by
NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x107b02c40>:
Failed to establish a new connection: [Errno 61] Connection refused'))`
7. In another window, I create/delete deployments in that namespace (via
context minikube, which does not go through the toxiproxy). I do this to create
new events.
8. After 6 minutes, I reenabled the toxiproxy
9. The python script retries the watch with resource_version=5659, it
connects and I get a `(410) Reason: Expired: too old resource version: 5659
(6859)`
> it happens when no event of type ADDED, MODIFIED, DELETED happens on the
watched resource for a long time [ ~ 5 mins]
I've been running a watch with the kubernetes python client to a namespace
where there is not new events at all for 1 hours and I did not get an
ApiException(410). So, are you sure of this? Have you ever seen this yourself
in your kubernetes environment?
> @ecerulm please go through the docs. it will help you understand why 410
occurs and how BOOKMARK will prevent it.
> it would really help the conversation if you would take the time to go
through the KEP and other docs.
I did read all the documents, and I think I understand this ok, also I have
actually done testing and try to actually back up what I say by doing it.
I think you mean something else by "prevent".
I hope the scenario I included in this comment will help you understand why
410 occurs in the event of network issues and how BOOKMARK **can't prevent**
that. In principle BOOKMARK will help to get a better "last known resource
version " at step 5 but by the time step 8 is reached that resource version
won't be valid (if enough time has passed). And this is not theory it's
something that you actually do test and reproduce yourself like I did.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]