potiuk commented on issue #39717: URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217496537
I see 2 different problems in this issue: > 1 - the task is never executed ( it is queued but the scheduler does not launch it) and this is the case where you have an external_task_id but you have no reference of it see it in the worker ( celery/flower ); This might indicate that somewhere on the way, the task has been lost. I think there might be a small rece condition between acknowledging the task and running task by celery, but that would be correlated with - for example - celery worker being killed (for example by ephemeral machine eviction etc. - but that would be a correlated event somewhere in the deployment. > 2 - the task is executed, the worker "tries" or launches it but something in the execution ( either in fork or in new process ) messes up the return value in the os.waitpid(). The curious part here is that for Airflow the task was executed with success despite that we see the failure in celery/flower. And here there is the "mini-scheduler" that happens after state of the task is set to "true" (which is an easy thing - "schedule-after-task-execution". Which is another thing that might be checked. But if you `can't` disable it then it's a good idea to see if there is any log in celery task or deployment that would indicate that THIS is the reason - because it's a hypothesis. Again it could for example be a celery task being evicted for whatever reason before saving the state and returning with "success" and celery master being able to record the success. In this case MAYBE a solution to avoid the celery status is to ignore errors coming from the "mini-scheduler" - but in order to do it, we need to have some indication of the error, then we can ignore it and then users experiencing it my apply a patch and (following my bisection example) - try if it fixes the problem. So ... overall - we need to combine findings and hypotheses coming from digging deeper (by our users) with attempts to apply hypothetic fixes if we can come up with some. This is pretty much the only way to be able to address the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org