amoghrajesh commented on issue #65011: URL: https://github.com/apache/airflow/issues/65011#issuecomment-4257842253
Thanks for the detailed logs. The trigger completing successfully rules out the trigger error situation I mentioned earlier. Looking at this again and one the thing really stands out is that `glue_job_run_details` gets a 409 within 4 seconds of try 1 starting on a fresh run. That key is written at the very beginning of `execute()` (deferrable or non deferrable), so for it to already exist in the DB at that point, something must have written it just before try 1. The most plausible explanation is a prior execution of the same task that the scheduler might have killed for missing a heartbeat or such, the worker was still alive and committing its xcoms while the newer try cleanup query had already run, so the cleanup missed those keys. To confirm, could you grep the scheduler logs for "Detected a task instance without a heartbeat" in that window? If there is such a find, that is the prior execution whose in the flight xcom write collided with try 1. Also useful, what is the value of `task_instance_heartbeat_timeout` and how many celery workers are you running? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
