hi  all!

I am using airflow to schedule my spark jobs on kubernetes cluster.

But for some reason, kubernetes often throw 'too old resource version'
exception which will interrupt spark watcher, then airflow will lost the
log stream and could not get 'Exit Code' eventually. So airflow will mark
job failed once log stream lost but the job is still running.

So is there any ways to avoid that and make sure airflow can always get the
right status of jobs?
Any suggestions?

I create an issue[https://github.com/apache/airflow/issues/8963] about that.
At the same time i am thinking  a solution about an simple retry
mechanism which
is when the log stream is interrupted, then airflow try to get 'Exit Code'
by 'kubectl describe pod xxxx-driver ' command. Here is the pull request[
https://github.com/apache/airflow/pull/8964], Comments are welcome.

Thanks,
Dylan

Reply via email to