I really like this idea as well! One of the _the most common_ questions I get 
from people managing an Airflow env is "Why is my task stuck in state X". 
Anything we can do to make that more discoverable and user friendly, especially 
in the UI instead of (or in addition to) logs would be fantastic!

Thanks to Jens for having a think and pointing out a lot of the implications, I 
agree a quick AIP might be nice for this one.

Cheers,
Niko

________________________________
From: Scheffler Jens (XC-DX/ETV5) <jens.scheff...@de.bosch.com.INVALID>
Sent: Thursday, September 28, 2023 10:36:00 PM
To: dev@airflow.apache.org
Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] The "no_status" state

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
contenu ne présente aucun risque.



Hi Ryan,

I really like the idea of exposing some more scheduler details. More 
transparency in scheduling also in the UI would help the user in (1) seeing and 
understanding what is going on and (2) reduces the need to crawl for logs and 
raise support tickets if status is “strange”. I often also see this as a 
problem. This is also sometimes generating a bit of “mis trust” in the 
scheduler stability.

From point of scheduler “overhead” I assume as long as we are not making a 
“full scan” just to ensure that each and every task is always up-to-date 
(Scheduler stops processing today after enough tasks have been processes in a 
loop or if scheduling limits are reached) this is OK for me and on the code 
side does not seem to be much overhead.
I have a bit of fear on the other hand that very many frequent updates need to 
happen on the DB as another state would need to be written. So more DB round 
trips are needed. This might hit performance for large DAGs or cases where DAGs 
are scheduled. So at least it would need to filter to update the state to DB 
only if changed to keep performance impact minimal.

From point of naming I still think “no status” is good to indicate that 
scheduler did not digest anything, maybe task was never looked at because 
scheduler actually is really stuck or too busy getting there. I would propose 
if scheduler passes along a task and decides that it is not ready to schedule 
to have an additional state calling e.g. “not_ready” in the state model between 
“none” and “scheduled”.

Finally on the other hand, adding another state in the model, I am not sure 
whether this 100% will help in the use case described by you. Still you might 
need to scratch your head a while if taking a look on UI that a DAG is “stuck” 
until you realize all the options you have configured. Exposing a “why is 
stuck” in a user friendly manner might be another level of complexity in this 
case.

As the state model might touch a lot of code and there might be a longer 
discussion needed, would it be a need to raise an AIP for this? There might be 
a lot more (external, provider??) dependencies adjusting the state model?

Mit freundlichen Grüßen / Best regards

Jens Scheffler

Deterministik open Loop (XC-DX/ETV5)
Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen | GERMANY | 
www.bosch.com<http://www.bosch.com>
Tel. +49 711 811-91508 | Mobil +49 160 90417410 | 
jens.scheff...@de.bosch.com<mailto:jens.scheff...@de.bosch.com>

Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer; Geschäftsführung: 
Dr. Stefan Hartung,
Dr. Christian Fischer, Dr. Markus Forschner, Stefan Grosch, Dr. Markus Heyn, 
Dr. Tanja Rückert
​
From: Ryan Hatter <ryan.hat...@astronomer.io.INVALID>
Sent: Donnerstag, 28. September 2023 23:59
To: dev@airflow.apache.org
Subject: The "no_status" state

Over the last couple weeks I've come across a rather tricky problem a few 
times. One DAG run gets "stuck" in the queued state, while subsequent DAG runs 
will be stuck running (screenshot below). One of these issues was caused by 
`max_active_runs` being met when a task instance from a previously run DAG was 
cleared, and one of the tasks had `depends_on_past=True`. This caused the DAG 
run to be stuck in queued in perpetuity until it was realized that the task 
that wasn't getting scheduled needed the failed task in the preceding DAG run 
to be re-run, which in turn causes the stuck running DAG runs to be stuck in 
running. which caused quite a bit of confusion and stress.

Given that Airflow is pretty burnt out on task instance states and colors, I 
propose replacing "no_status" with "dependencies_not_met" and surfacing 
dependencies in the grid view instead of forcing users to already know where to 
look (i.e. "more details" task instance details). Now that I typed it out, I'm 
not sure there should be a reason for the "more details" button and not just 
laying out all of a task instance's details in the grid view similar to how the 
graph and code views are now included in the grid view.

Anyway, I wanted to solicit feedback before I open an issue / start work on 
this.

[cid:ii_ln3phzoe0]

Reply via email to