cesar-vermeulen opened a new issue, #35841:
URL: https://github.com/apache/airflow/issues/35841

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   We have a task with a hard restriction that it should not run more than 
once. However, we notice that when the airflow scheduler crashes for whatever 
reason and there is a task running, the task seems to be retried when the 
scheduler restores, even though the first tasks succeeded just fine:
   
   
   ** SCHEDULER LOGS **
   
   ```
   2023-11-24T04:20:05.554652890Z {"asctime": "2023-11-24T05:20:05.554+0100", 
"filename": "scheduler_job_runner.py", "lineno": 248, "levelname": "INFO", 
"message": "Exiting gracefully upon receiving signal 15"}
   
   
   2023-11-24T04:20:06.801855078Z {"asctime": "2023-11-24T05:20:06.795+0100", 
"filename": "scheduler_job_runner.py", "lineno": 862, "levelname": "ERROR", 
"message": "Exception when executing SchedulerJob._run_scheduler_loop"}
   2023-11-24T04:20:06.801866656Z Traceback (most recent call last):
   2023-11-24T04:20:06.801869902Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py",
 line 385, in sync
   2023-11-24T04:20:06.801872272Z     self.kube_scheduler.run_next(task)
   2023-11-24T04:20:06.801875214Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 406, in run_next
   2023-11-24T04:20:06.801877950Z     self.run_pod_async(pod, 
**self.kube_config.kube_client_request_args)
   2023-11-24T04:20:06.801880748Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 311, in run_pod_async
   2023-11-24T04:20:06.801883294Z     resp = 
self.kube_client.create_namespaced_pod(
   2023-11-24T04:20:06.801885724Z            
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801888109Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/kubernetes/client/api/core_v1_api.py",
 line 7356, in create_namespaced_pod
   2023-11-24T04:20:06.801890595Z     return 
self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)  # noqa: 
E501
   2023-11-24T04:20:06.801892953Z            
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801895349Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/kubernetes/client/api/core_v1_api.py",
 line 7455, in create_namespaced_pod_with_http_info
   2023-11-24T04:20:06.801897690Z     return self.api_client.call_api(
   2023-11-24T04:20:06.801900014Z            ^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801902925Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/kubernetes/client/api_client.py",
 line 348, in call_api
   2023-11-24T04:20:06.801905310Z     return self.__call_api(resource_path, 
method,
   2023-11-24T04:20:06.801907565Z            
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801910120Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/kubernetes/client/api_client.py",
 line 180, in __call_api
   2023-11-24T04:20:06.801912995Z     response_data = self.request(
   2023-11-24T04:20:06.801916095Z                     ^^^^^^^^^^^^^
   2023-11-24T04:20:06.801919748Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/kubernetes/client/api_client.py",
 line 391, in request
   2023-11-24T04:20:06.801923433Z     return self.rest_client.POST(url,
   2023-11-24T04:20:06.801926644Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801930019Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/kubernetes/client/rest.py", 
line 275, in POST
   2023-11-24T04:20:06.801933241Z     return self.request("POST", url,
   2023-11-24T04:20:06.801936265Z            ^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801939589Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/kubernetes/client/rest.py", 
line 168, in request
   2023-11-24T04:20:06.801955466Z     r = self.pool_manager.request(
   2023-11-24T04:20:06.801958232Z         ^^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801960514Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/urllib3/request.py", line 
81, in request
   2023-11-24T04:20:06.801962900Z     return self.request_encode_body(
   2023-11-24T04:20:06.801965253Z            ^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801967510Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/urllib3/request.py", line 
173, in request_encode_body
   2023-11-24T04:20:06.801969921Z     return self.urlopen(method, url, 
**extra_kw)
   2023-11-24T04:20:06.801972159Z            
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801974864Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/urllib3/poolmanager.py", 
line 376, in urlopen
   2023-11-24T04:20:06.801977228Z     response = conn.urlopen(method, 
u.request_uri, **kw)
   2023-11-24T04:20:06.801979519Z                
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801981756Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/urllib3/connectionpool.py", 
line 715, in urlopen
   2023-11-24T04:20:06.801984109Z     httplib_response = self._make_request(
   2023-11-24T04:20:06.801986431Z                        ^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.801988730Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/urllib3/connectionpool.py", 
line 467, in _make_request
   2023-11-24T04:20:06.801991136Z     six.raise_from(e, None)
   2023-11-24T04:20:06.801993395Z   File "<string>", line 3, in raise_from
   2023-11-24T04:20:06.801996129Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/urllib3/connectionpool.py", 
line 462, in _make_request
   2023-11-24T04:20:06.801998489Z     httplib_response = conn.getresponse()
   2023-11-24T04:20:06.802000811Z                        ^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.802003046Z   File 
"/usr/local/lib/python3.11/http/client.py", line 1378, in getresponse
   2023-11-24T04:20:06.802005565Z     response.begin()
   2023-11-24T04:20:06.802008018Z   File 
"/usr/local/lib/python3.11/http/client.py", line 318, in begin
   2023-11-24T04:20:06.802010362Z     version, status, reason = 
self._read_status()
   2023-11-24T04:20:06.802012949Z                               
^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.802015208Z   File 
"/usr/local/lib/python3.11/http/client.py", line 279, in _read_status
   2023-11-24T04:20:06.802017388Z     line = str(self.fp.readline(_MAXLINE + 
1), "iso-8859-1")
   2023-11-24T04:20:06.802019561Z                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.802021792Z   File "/usr/local/lib/python3.11/socket.py", 
line 706, in readinto
   2023-11-24T04:20:06.802032537Z     return self._sock.recv_into(b)
   2023-11-24T04:20:06.802034849Z            ^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.802037202Z   File "/usr/local/lib/python3.11/ssl.py", 
line 1311, in recv_into
   2023-11-24T04:20:06.802039465Z     return self.read(nbytes, buffer)
   2023-11-24T04:20:06.802041676Z            ^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.802043835Z   File "/usr/local/lib/python3.11/ssl.py", 
line 1167, in read
   2023-11-24T04:20:06.802045996Z     return self._sslobj.read(len, buffer)
   2023-11-24T04:20:06.802048210Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.802050459Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/jobs/scheduler_job_runner.py",
 line 251, in _exit_gracefully
   2023-11-24T04:20:06.802052641Z     sys.exit(os.EX_OK)
   2023-11-24T04:20:06.802054965Z SystemExit: 0
   2023-11-24T04:20:06.802057156Z 
   2023-11-24T04:20:06.802059495Z During handling of the above exception, 
another exception occurred:
   2023-11-24T04:20:06.802061585Z 
   2023-11-24T04:20:06.802066599Z Traceback (most recent call last):
   2023-11-24T04:20:06.802068869Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/jobs/scheduler_job_runner.py",
 line 845, in _execute
   2023-11-24T04:20:06.802071210Z     self._run_scheduler_loop()
   2023-11-24T04:20:06.802073476Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/jobs/scheduler_job_runner.py",
 line 981, in _run_scheduler_loop
   2023-11-24T04:20:06.802075636Z     self.job.executor.heartbeat()
   2023-11-24T04:20:06.802077933Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/executors/base_executor.py",
 line 237, in heartbeat
   2023-11-24T04:20:06.802080126Z     self.sync()
   2023-11-24T04:20:06.802082757Z   File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py",
 line 416, in sync
   2023-11-24T04:20:06.802084960Z     self.task_queue.task_done()
   2023-11-24T04:20:06.802087253Z   File "<string>", line 2, in task_done
   2023-11-24T04:20:06.802089463Z   File 
"/usr/local/lib/python3.11/multiprocessing/managers.py", line 821, in 
_callmethod
   2023-11-24T04:20:06.802091658Z     conn.send((self._id, methodname, args, 
kwds))
   2023-11-24T04:20:06.802093901Z   File 
"/usr/local/lib/python3.11/multiprocessing/connection.py", line 206, in send
   2023-11-24T04:20:06.802096154Z     
self._send_bytes(_ForkingPickler.dumps(obj))
   2023-11-24T04:20:06.802098712Z   File 
"/usr/local/lib/python3.11/multiprocessing/connection.py", line 427, in 
_send_bytes
   2023-11-24T04:20:06.802100955Z     self._send(header + buf)
   2023-11-24T04:20:06.802103273Z   File 
"/usr/local/lib/python3.11/multiprocessing/connection.py", line 384, in _send
   2023-11-24T04:20:06.802105612Z     n = write(self._handle, buf)
   2023-11-24T04:20:06.802107767Z         ^^^^^^^^^^^^^^^^^^^^^^^^
   2023-11-24T04:20:06.802112375Z BrokenPipeError: [Errno 32] Broken pipe
   2023-11-24T04:20:06.802120372Z {"asctime": "2023-11-24T05:20:06.801+0100", 
"filename": "kubernetes_executor.py", "lineno": 695, "levelname": "INFO", 
"message": "Shutting down Kubernetes executor"}
   ```
   
   **TASK ATTEMPT 1**
   ```
   {"asctime": "2023-11-24, 05:17:30 CET", "filename": "taskinstance.py", 
"lineno": 1359, "levelname": "INFO", "message": "Starting attempt 1 of 1"}
   ...
   {"asctime": "2023-11-24, 05:18:02 CET", "filename": 
"local_task_job_runner.py", "lineno": 228, "levelname": "INFO", "message": 
"Task exited with return code 0"}
   ```
   **TASK ATTEMPT 2**
   ```
   {"asctime": "2023-11-24, 05:27:26 CET", "filename": "taskinstance.py", 
"lineno": 1157, "levelname": "INFO", "message": "Dependencies all met for 
dep_context=requeueable deps ti=<TaskInstance: _redacted_ 
scheduled__2023-11-23T00:00:00+00:00 [queued]>"}
   {"asctime": "2023-11-24, 05:27:26 CET", "filename": "taskinstance.py", 
"lineno": 1359, "levelname": "INFO", "message": "Starting attempt **2 of 1**"}
   {"asctime": "2023-11-24, 05:27:26 CET", "filename": "taskinstance.py", 
"lineno": 1380, "levelname": "INFO", "message": "Executing 
<Task(AzureDataFactoryRunPipelineOperator): _redacted_> on 2023-11-23 
00:00:00+00:00"}
   ```
   
   Retry configuration of task:
   
![image](https://github.com/apache/airflow/assets/94971679/118bbba0-c2b9-4f36-86dc-1d3ec2b1b2d3)
   
   
   ### What you think should happen instead
   
   Tasks should not be retried when retries=0
   
   ### How to reproduce
   
   Not entirely sure. This happens once every while during our nightly loads - 
my assumption here would be that health checks fail for Airflow scheduler, 
scheduler crashes and does not keep track of tasks being in queue
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes==7.8.0
   apache-airflow-providers-common-sql==1.8.0
   apache-airflow-providers-databricks==4.7.0
   apache-airflow-providers-docker==3.8.0
   apache-airflow-providers-elasticsearch==5.0.1
   apache-airflow-providers-ftp==3.6.0
   apache-airflow-providers-http==4.6.0
   apache-airflow-providers-imap==3.4.0
   apache-airflow-providers-microsoft-azure==8.1.0
   apache-airflow-providers-microsoft-mssql==3.5.0
   apache-airflow-providers-odbc==4.1.0
   apache-airflow-providers-postgres==5.7.1
   apache-airflow-providers-sqlite==3.5.0
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Deployment via KubernetesExecutor, with following configuration for the 
scheduler
   ```
   scheduler:
     replicas: 3
     resources:
       limits:
         cpu: 3
       requests:
         cpu: 1
     livenessProbe:
       timeoutSeconds: 120
       failureThreshold: 8
   ```
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to