hi all! I am using airflow to schedule my spark jobs on kubernetes cluster.
But for some reason, kubernetes often throw 'too old resource version' exception which will interrupt spark watcher, then airflow will lost the log stream and could not get 'Exit Code' eventually. So airflow will mark job failed once log stream lost but the job is still running. So is there any ways to avoid that and make sure airflow can always get the right status of jobs? Any suggestions? I create an issue[https://github.com/apache/airflow/issues/8963] about that. At the same time i am thinking a solution about an simple retry mechanism which is when the log stream is interrupted, then airflow try to get 'Exit Code' by 'kubectl describe pod xxxx-driver ' command. Here is the pull request[ https://github.com/apache/airflow/pull/8964], Comments are welcome. Thanks, Dylan