trlopes1974 commented on issue #39717: URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217421431
Humm. I'm not sure I agree with the "configuration specific" targeting problem/issue/ whatever. It is clear now that this happens with several configurations Kubernets, Celery / Redis ( we have RabbitMQ). Some have clearly stated that messing with the task_adoption_timeout (increasing to 2H or so) has fixed their issue, and this gives me migraines has it makes no sense ( in my mind ) how can a timeout value interact with the scheduling/executing of tasks. In my last provided logs you can see that after 10minutes the task is marked as failed but there is no evidence that it left the queued state... could it be some logic failure in the scheduler/worker? (I see no concurrency or exhaustion issues on our setup). I see 2 different problems in this issue: 1 - the task is never executed ( it is queued but the scheduler does not launch it) and this is the case where you have an external_task_id but you have no reference of it see it in the worker ( celery/flower ); 2 - the task is executed, the worker "tries" or launches it but something in the execution ( either in fork or in new process ) messes up the return value in the os.waitpid(). The curious part here is that for Airflow the task was executed with success despite that we see the failure in celery/flower. Yes, it seems that this is one of those that keeps hiding in several places and it will be hard to find it. The good news is that (in our case) it keeps happening from time to time, randomly on different tasks. One curious thing is that, in our case, it is affecting only a few DAGs and not others.... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org