potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217496537

   I see 2 different problems in this issue:
   > 1 - the task is never executed ( it is queued but the scheduler does not 
launch it) and this is the case where you have an external_task_id but you have 
no reference of it see it in the worker ( celery/flower );
   
   This might indicate that somewhere on the way, the task has been lost. I 
think there might be a small rece condition between acknowledging the task and 
running task by celery, but that would be correlated with - for example - 
celery worker being killed (for example by ephemeral machine eviction etc. - 
but that would be a correlated event somewhere in the deployment.
   
   > 2 - the task is executed, the worker "tries" or launches it but something 
in the execution ( either in fork or in new process ) messes up the return 
value in the os.waitpid(). The curious part here is that for Airflow the task 
was executed with success despite that we see the failure in celery/flower.
   
   And here there is the "mini-scheduler" that happens after state of the task 
is set to "true" (which is an easy thing - "schedule-after-task-execution". 
Which is another thing that might be checked. But if you `can't` disable it 
then it's a good idea to see if there is any log in celery task or deployment 
that would indicate that THIS is the reason - because it's a hypothesis. Again 
it could for example be a celery task being evicted for whatever reason before 
saving the state and returning with "success" and celery master being able to 
record the success. In this case MAYBE a solution to avoid the celery status is 
to ignore errors coming from the "mini-scheduler" - but in order to do it, we 
need to have some indication of the error, then we can ignore it and then users 
experiencing it my apply a patch and (following my bisection example) - try if 
it fixes the problem.
   
   So ... overall - we need to combine findings and hypotheses coming from 
digging deeper (by our users) with attempts to apply hypothetic fixes  if we 
can come up with some. This is pretty much the only way to be able to address 
the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to