Nataneljpwd commented on PR #61110: URL: https://github.com/apache/airflow/pull/61110#issuecomment-3833405080
Hello, thank you for the comment! > @Nataneljpwd > > I still have two blocking concerns here: > > 1. I don’t think the quorum-read argument holds. Not setting `resourceVersion` may require a quorum read , but that only guarantees freshness of what the API server returns. It doesn’t mean we’re reading directly from etcd, and more importantly it doesn’t enforce any stronger invariants (like uniqueness or lifecycle ordering). This is a valid concern, though from reading the k8s api documentation, not setting a `resourceVersion` [_requires_](https://kubernetes.io/docs/reference/using-api/api-concepts/#:~:text=Unless%20you%20have,to%20be%20served.) a quorum read to be served, it does mean that the API server validates with etcd before returning a result, and so it seems to achieve the same result. About the invarients, I understand the concern, I have explained in the code in a comment why we do not care about the ordering, and why we won't get more than 1 result. > 2. Even if the API response reflects the latest etcd state, Kubernetes does not guarantee that there can’t be multiple pods with the same labels in `Pending` or `Running` at the same time. I understand the SparkApplication controller intends there to be a single driver, but that guarantee is eventual. During restarts or reconciliation, overlapping driver pods are possible and expected. Your filter for deletion_timestamp partially mitigates against this by exlcuding terminating pods but I suspect there can still be windows where the replacement pod is created before the deletion for the existing pod is triggered. Yes, though the Spark kubernetes operator which handles the spark crd makes sure there is only 1 driver up at any given time, the transition does not happen, as before a pod is created, the old pods are deleted, as written [here](https://www.kubeflow.org/docs/components/spark-operator/overview/#:~:text=When%20the%20operator%20decides%20to,create%20a%20new%20driver%20pod.). > Because of that, removing the existing phase-based prioritization is risky. I see that you are now prioritizing 'Succeeded' pods but since you have removed the ordering for `Running` and `Pending` pods, this does not mitigate the concerns I explained above. Don't we want to prioritize Succeeded pods? as there is no case in which a driver for an application will be running while there is a pod in `Succeeded` state, I will add back the creation time check, as it is something I most likelly deleted on accident. > I would advise you to implement something like this: > > ``` > pod = max( > pod_list, > key=lambda p: ( > p.metadata.deletion_timestamp is None, > p.status.phase == PodPhase.RUNNING, > p.status.phase == PodPhase.SUCCEEDED, > p.metadata.creation_timestamp or datetime.min.replace(tzinfo=timezone.utc), > p.metadata.name or "", > ), > ) > ``` > > This handles the edge cases revealed by this PR without accidentally returning multiple pods. I would advise removing the field selector as well because we are already filtering in the lambda expression. We can leave the field selector, as we do not need to implicitly choose Running pods, as a running pod can only occur if it is terminating, meaning that deletion timestamp is not None, in addition to not having the RUNNING preference, as it means we might select a terminating pod before a newly created pod. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
