I really like this idea as well! One of the _the most common_ questions I get from people managing an Airflow env is "Why is my task stuck in state X". Anything we can do to make that more discoverable and user friendly, especially in the UI instead of (or in addition to) logs would be fantastic!
Thanks to Jens for having a think and pointing out a lot of the implications, I agree a quick AIP might be nice for this one. Cheers, Niko ________________________________ From: Scheffler Jens (XC-DX/ETV5) <jens.scheff...@de.bosch.com.INVALID> Sent: Thursday, September 28, 2023 10:36:00 PM To: dev@airflow.apache.org Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] The "no_status" state CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le contenu ne présente aucun risque. Hi Ryan, I really like the idea of exposing some more scheduler details. More transparency in scheduling also in the UI would help the user in (1) seeing and understanding what is going on and (2) reduces the need to crawl for logs and raise support tickets if status is “strange”. I often also see this as a problem. This is also sometimes generating a bit of “mis trust” in the scheduler stability. From point of scheduler “overhead” I assume as long as we are not making a “full scan” just to ensure that each and every task is always up-to-date (Scheduler stops processing today after enough tasks have been processes in a loop or if scheduling limits are reached) this is OK for me and on the code side does not seem to be much overhead. I have a bit of fear on the other hand that very many frequent updates need to happen on the DB as another state would need to be written. So more DB round trips are needed. This might hit performance for large DAGs or cases where DAGs are scheduled. So at least it would need to filter to update the state to DB only if changed to keep performance impact minimal. From point of naming I still think “no status” is good to indicate that scheduler did not digest anything, maybe task was never looked at because scheduler actually is really stuck or too busy getting there. I would propose if scheduler passes along a task and decides that it is not ready to schedule to have an additional state calling e.g. “not_ready” in the state model between “none” and “scheduled”. Finally on the other hand, adding another state in the model, I am not sure whether this 100% will help in the use case described by you. Still you might need to scratch your head a while if taking a look on UI that a DAG is “stuck” until you realize all the options you have configured. Exposing a “why is stuck” in a user friendly manner might be another level of complexity in this case. As the state model might touch a lot of code and there might be a longer discussion needed, would it be a need to raise an AIP for this? There might be a lot more (external, provider??) dependencies adjusting the state model? Mit freundlichen Grüßen / Best regards Jens Scheffler Deterministik open Loop (XC-DX/ETV5) Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen | GERMANY | www.bosch.com<http://www.bosch.com> Tel. +49 711 811-91508 | Mobil +49 160 90417410 | jens.scheff...@de.bosch.com<mailto:jens.scheff...@de.bosch.com> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer; Geschäftsführung: Dr. Stefan Hartung, Dr. Christian Fischer, Dr. Markus Forschner, Stefan Grosch, Dr. Markus Heyn, Dr. Tanja Rückert From: Ryan Hatter <ryan.hat...@astronomer.io.INVALID> Sent: Donnerstag, 28. September 2023 23:59 To: dev@airflow.apache.org Subject: The "no_status" state Over the last couple weeks I've come across a rather tricky problem a few times. One DAG run gets "stuck" in the queued state, while subsequent DAG runs will be stuck running (screenshot below). One of these issues was caused by `max_active_runs` being met when a task instance from a previously run DAG was cleared, and one of the tasks had `depends_on_past=True`. This caused the DAG run to be stuck in queued in perpetuity until it was realized that the task that wasn't getting scheduled needed the failed task in the preceding DAG run to be re-run, which in turn causes the stuck running DAG runs to be stuck in running. which caused quite a bit of confusion and stress. Given that Airflow is pretty burnt out on task instance states and colors, I propose replacing "no_status" with "dependencies_not_met" and surfacing dependencies in the grid view instead of forcing users to already know where to look (i.e. "more details" task instance details). Now that I typed it out, I'm not sure there should be a reason for the "more details" button and not just laying out all of a task instance's details in the grid view similar to how the graph and code views are now included in the grid view. Anyway, I wanted to solicit feedback before I open an issue / start work on this. [cid:ii_ln3phzoe0]