scaoupgrade commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2269606223

   I have been following this thread recently since we also experienced this 
issue on airlfow `2.8.4`.  We have been running on this version for over two 
months and this is the first time I see this error. this may suggest that this 
issue happens less often on `2.8.X`?
   
   I see  two issues being discussed in this thread:
   
   1. The airflow scheduler complains about:  `Executor reports task instance 
<TaskInstance: (...)>  finished (failed) although the task says it's queued. 
(Info: None) Was the task killed externally?`
   
   2. The airflow worker throws error on:  
`airflow.exceptions.AirflowException: Celery command failed on host: xxxx with 
celery_task_id xxxxx`
   
   Based on my observation on the logs when the issue happened the other day,  
these two are not the same issue.
   
   Issue 2 happens frequently, I can see about 1600 messages of such error on 
daily basis, and the number of errors I observe everyday are stable.
   
   Thanks @potiuk  for providing a fix.  
https://github.com/apache/airflow/pull/41260/files could address issue 2, but 
issue 1 should be something else. 
   
   Because the day the incident happened on our platform, I see a burst of 
messages like:  `Executor reports task instance <TaskInstance: (...)>  finished 
(failed) although the task says it's queued. (Info: None) Was the task killed 
externally?`, while the error in worker log saying celery command failed 
remains stable (around 1600 messages).
   
   by looking at the scheduler log when the issue happened, I notice this 
pattern being repeated for the same task multiple times for a given dag:
   ```
   {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 
[scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}"
   {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 
[scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}"
   {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 
[scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}"
   {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 
[scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}"
   {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 
[scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}"
   {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 
[scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}"
   ```
   The same line is repeated for the each task in the that dag hundreds of 
times, which seems to be abnormal.
   
   Looks like scheduler dag processor runs into some issue and something failed 
during the scheduling phase. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to