meetri opened a new issue, #24538: URL: https://github.com/apache/airflow/issues/24538
### Apache Airflow version 2.3.2 (latest released) ### What happened The scheduler crashes with the following exception. Once the scheduler crashes restarts will cause it to immediately crash again. To get scheduler back working. All dags must be paused and all tasks that are running need to have it's state changed to up for retry. This is something we just started noticing after switching to the CeleryKubernetesExecutor. ``` [2022-06-16 20:12:04,535] {scheduler_job.py:1350} WARNING - Failing (3) jobs without heartbeat after 2022-06-16 20:07:04.512590+00:00 [2022-06-16 20:12:04,535] {scheduler_job.py:1358} ERROR - Detected zombie job: {'full_filepath': '/airflow-efs/dags/Scanner.py', 'msg': 'Detected <TaskInstance: lmnop-domain-scanner.Macadocious manual__2022-06-16T02:27:36.281445+00:00 [running]> as zombie', 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f96de2fc890>, 'is_failure_callback': True} [2022-06-16 20:12:04,537] {scheduler_job.py:756} ERROR - Exception when executing SchedulerJob._run_scheduler_loop Traceback (most recent call last): File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 739, in _execute self._run_scheduler_loop() File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 839, in _run_scheduler_loop next_event = timers.run(blocking=False) File "/usr/local/lib/python3.7/sched.py", line 151, in run action(*argument, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat action(*args, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper return func(*args, session=session, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1359, in _find_zombies self.executor.send_callback(request) File "/pyroot/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py", line 218, in send_callback self.callback_sink.send(request) File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper return func(*args, session=session, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/database_callback_sink.py", line 34, in send db_callback = DbCallbackRequest(callback=callback, priority_weight=10) File "<string>", line 4, in __init__ File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 437, in _initialize_instance manager.dispatch.init_failure(self, args, kwargs) File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__ with_traceback=exc_tb, File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_ raise exception File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 434, in _initialize_instance return manager.original_init(*mixed[1:], **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/models/db_callback_request.py", line 44, in __init__ self.callback_data = callback.to_json() File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/callback_requests.py", line 79, in to_json return json.dumps(dict_obj) File "/usr/local/lib/python3.7/json/__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "/usr/local/lib/python3.7/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/local/lib/python3.7/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/local/lib/python3.7/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type datetime is not JSON serializable [2022-06-16 20:12:04,573] {kubernetes_executor.py:813} INFO - Shutting down Kubernetes executor [2022-06-16 20:12:04,574] {kubernetes_executor.py:773} WARNING - Executor shutting down, will NOT run task=(TaskInstanceKey(dag_id='lmnop-processor', task_id='launch-xyz-pod', run_id='manual__2022-06-16T19:53:04.707461+00:00', try_number=1, map_index=-1), ['airflow', 'tasks', 'run', 'lmnop-processor', 'launch-xyz-pod', 'manual__2022-06-16T19:53:04.707461+00:00', '--local', '--subdir', 'DAGS_FOLDER/lmnop.py'], None, None) [2022-06-16 20:12:04,574] {kubernetes_executor.py:773} WARNING - Executor shutting down, will NOT run task=(TaskInstanceKey(dag_id='lmnop-processor', task_id='launch-xyz-pod', run_id='manual__2022-06-16T19:53:04.831929+00:00', try_number=1, map_index=-1), ['airflow', 'tasks', 'run', 'lmnop-processor', 'launch-xyz-pod', 'manual__2022-06-16T19:53:04.831929+00:00', '--local', '--subdir', 'DAGS_FOLDER/lmnop.py'], None, None) [2022-06-16 20:12:04,601] {scheduler_job.py:768} INFO - Exited execute loop Traceback (most recent call last): File "/pyroot/bin/airflow", line 8, in <module> sys.exit(main()) File "/pyroot/lib/python3.7/site-packages/airflow/__main__.py", line 38, in main args.func(args) File "/pyroot/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 51, in command return func(*args, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/utils/cli.py", line 99, in wrapper return f(*args, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 75, in scheduler _run_scheduler_job(args=args) File "/pyroot/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 46, in _run_scheduler_job job.run() File "/pyroot/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 244, in run self._execute() File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 739, in _execute self._run_scheduler_loop() File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 839, in _run_scheduler_loop next_event = timers.run(blocking=False) File "/usr/local/lib/python3.7/sched.py", line 151, in run action(*argument, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat action(*args, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper return func(*args, session=session, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1359, in _find_zombies self.executor.send_callback(request) File "/pyroot/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py", line 218, in send_callback self.callback_sink.send(request) File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper return func(*args, session=session, **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/database_callback_sink.py", line 34, in send db_callback = DbCallbackRequest(callback=callback, priority_weight=10) File "<string>", line 4, in __init__ File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 437, in _initialize_instance manager.dispatch.init_failure(self, args, kwargs) File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__ with_traceback=exc_tb, File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_ raise exception File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 434, in _initialize_instance return manager.original_init(*mixed[1:], **kwargs) File "/pyroot/lib/python3.7/site-packages/airflow/models/db_callback_request.py", line 44, in __init__ self.callback_data = callback.to_json() File "/pyroot/lib/python3.7/site-packages/airflow/callbacks/callback_requests.py", line 79, in to_json return json.dumps(dict_obj) File "/usr/local/lib/python3.7/json/__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "/usr/local/lib/python3.7/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/local/lib/python3.7/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/local/lib/python3.7/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type datetime is not JSON serializable ``` ### What you think should happen instead The error itself seems like a minor issue and should not happen and easy to fix. But what seems like a bigger issue is how the scheduler was not able to recover on it's own and was stuck in an endless restart loop. ### How to reproduce I'm not sure of the most simple step by step way to reproduce. But the conditions of my airflow workflow was about 4 active dags chugging through with about 50 max active runs and 50 concurrent each, with one dag set with 150 max active runs and 50 concurrent. ( not really that much ) The dag with the 150 max active runs is running the kubernetesExecutor create a pod in the local kubernetes environment. this I think is the reason we're seeing this issue all of a sudden. Hopefully this helps in potentially reproducing it. ### Operating System Debian GNU/Linux 10 (buster) ### Versions of Apache Airflow Providers apache-airflow-providers-amazon==3.4.0 apache-airflow-providers-celery==2.1.4 apache-airflow-providers-cncf-kubernetes==4.0.2 apache-airflow-providers-ftp==2.1.2 apache-airflow-providers-http==2.1.2 apache-airflow-providers-imap==2.2.3 apache-airflow-providers-postgres==4.1.0 apache-airflow-providers-redis==2.0.4 apache-airflow-providers-sqlite==2.1.3 ### Deployment Other Docker-based deployment ### Deployment details we create our own airflow base images using the instructions provided on your site, here is a snippet of the code we use to install ``` RUN pip3 install "apache-airflow[statsd,aws,kubernetes,celery,redis,postgres,sentry]==${AIRFLOW_VERSION}" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-$AIRFLOW_VERSION/constraints-$PYTHON_VERSION.txt" ``` We then use this docker image for all of our airflow workers, scheduler, dagprocessor and airflow web This is managed through a custom helm script. Also we have incorporated the use of pgbouncer to manage db connections similar to the publicly available helm charts ### Anything else The problem seems to occur quite frequently. It makes the system completely unusable. ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org