I think a good first step that wouldn't require an AIP would be surfacing a task instance's dependencies in the task instance details "sub-view" of the grid view. I've created an issue here: https://github.com/apache/airflow/issues/35935
On Fri, Oct 20, 2023 at 9:35 AM Pierre Jeambrun <[email protected]> wrote: > Seems like a good idea. Some kind of "task diagnosis", in case the state is > not settled to give more context to users. > > Happy to help on that one as well. I also think that a small AIP is > required, the scope of change could be substantial. > > Best regards, > Pierre > > Le jeu. 19 oct. 2023 à 17:05, Brent Bovenzi <[email protected]> > a > écrit : > > > Like what Jarek said, some of these dependencies might take a lot of work > > to surface correctly. But I am happy to improve the grid and graph to > show > > more information, like integrating rendered_templates and more details > into > > the Grid view. Mind to open a github issue for some of those smaller > tasks > > so I don't forget to do it? > > > > I am also playing with some ways to show datasets and other external > > dependencies better in grid/graph view too. > > > > On Thu, Oct 19, 2023 at 10:48 AM Jarek Potiuk <[email protected]> wrote: > > > > > I think it will be tricky to get all the reasons surfaced to the user > why > > > the task is not run. But surfacing it to the user is indeed a good > idea. > > > Currently this is only done by this FAQ response - showing possible > > reasons > > > > > > > > > https://airflow.apache.org/docs/apache-airflow/stable/faq.html#why-is-task-not-getting-scheduled > > > - and I believe this is not a complete list after a number of > > > features implemented since this FAQ was written. > > > > > > The question is open I think (and agree with Jens comments this should > > be a > > > small "AIP" level) is which of those we are able to deterministically > > > detect. A bit of a problem here is (also as Jens mentioned) that in > many > > > cases the task in DB is simply skipped during scheduler because of some > > of > > > the reasons explained in the FAQ (and some not explained). Sometimes > > > simply the task will not be scheduled because the scheduler has not yet > > had > > > a chance to look at it due to performance reasons. That's why I believe > > we > > > really do not need a new status, but more automated analysis - in the > > "more > > > details" tab, when the user specifically asks for it. That could give > the > > > user possible reasons for this particular task. This would be much > better > > > to do it on "individual" task level when users asks "why this > particular > > > task is not scheduled" - because then you could query the DB and figure > > it > > > out, recording and determining the information upfront might not be > > > possible from the performance reasons - simply because scheduler never > > > really looks at all possible tasks (that would be prohibitively > > expensive) > > > - instead it effectively finds a subset the "good candidates to > > schedule" - > > > which is much smaller set to run queries for. > > > > > > Some of that could be deterministically determined today. For example > the > > > "upstream tasks are still running". Some of that might be a little > "racy" > > > though - because simply the system is continuously running - so what > > caused > > > the task to not be scheduled in the previous pass of scheduler, might > not > > > be valid any more (but there might still be other reasons). I think the > > > difficult ones might require additional information recorded by the > > > scheduler (for example scheduler recording the fact that it has > completed > > > the last pass with still remaining dag runs to look at or fact that the > > > number of tasks seen in the last pass reached the global concurrency > > > limits). But some of this might not be even possible to determine by > > > scheduler without some major query changes (for example scheduler will > > run > > > the query including pools size - the way how pool query is done that > you > > > simply select "pool size" eligible tasks and you have no idea if there > > were > > > more that there are more tasks that were excluded from the result (nor > > > which tasks they were). This is where looking at individual tasks and > > > working out "backwards" - guessing why might be needed. But possibly > it > > > could be helped with some extra information stored by the scheduler. > > > > > > I think we will not have a complete and fully accurate picture, but I > > think > > > iteratively we could get this better and better. > > > > > > J > > > > > > > > > On Mon, Oct 16, 2023 at 11:55 PM Oliveira, Niko > > > <[email protected]> > > > wrote: > > > > > > > I really like this idea as well! One of the _the most common_ > > questions I > > > > get from people managing an Airflow env is "Why is my task stuck in > > state > > > > X". Anything we can do to make that more discoverable and user > > friendly, > > > > especially in the UI instead of (or in addition to) logs would be > > > fantastic! > > > > > > > > Thanks to Jens for having a think and pointing out a lot of the > > > > implications, I agree a quick AIP might be nice for this one. > > > > > > > > Cheers, > > > > Niko > > > > > > > > ________________________________ > > > > From: Scheffler Jens (XC-DX/ETV5) <[email protected] > > .INVALID> > > > > Sent: Thursday, September 28, 2023 10:36:00 PM > > > > To: [email protected] > > > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] The "no_status" state > > > > > > > > CAUTION: This email originated from outside of the organization. Do > not > > > > click links or open attachments unless you can confirm the sender and > > > know > > > > the content is safe. > > > > > > > > > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur > > externe. > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne > > > pouvez > > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas > certain > > > que > > > > le contenu ne présente aucun risque. > > > > > > > > > > > > > > > > Hi Ryan, > > > > > > > > I really like the idea of exposing some more scheduler details. More > > > > transparency in scheduling also in the UI would help the user in (1) > > > seeing > > > > and understanding what is going on and (2) reduces the need to crawl > > for > > > > logs and raise support tickets if status is “strange”. I often also > see > > > > this as a problem. This is also sometimes generating a bit of “mis > > trust” > > > > in the scheduler stability. > > > > > > > > From point of scheduler “overhead” I assume as long as we are not > > making > > > a > > > > “full scan” just to ensure that each and every task is always > > up-to-date > > > > (Scheduler stops processing today after enough tasks have been > > processes > > > in > > > > a loop or if scheduling limits are reached) this is OK for me and on > > the > > > > code side does not seem to be much overhead. > > > > I have a bit of fear on the other hand that very many frequent > updates > > > > need to happen on the DB as another state would need to be written. > So > > > more > > > > DB round trips are needed. This might hit performance for large DAGs > or > > > > cases where DAGs are scheduled. So at least it would need to filter > to > > > > update the state to DB only if changed to keep performance impact > > > minimal. > > > > > > > > From point of naming I still think “no status” is good to indicate > that > > > > scheduler did not digest anything, maybe task was never looked at > > because > > > > scheduler actually is really stuck or too busy getting there. I would > > > > propose if scheduler passes along a task and decides that it is not > > ready > > > > to schedule to have an additional state calling e.g. “not_ready” in > the > > > > state model between “none” and “scheduled”. > > > > > > > > Finally on the other hand, adding another state in the model, I am > not > > > > sure whether this 100% will help in the use case described by you. > > Still > > > > you might need to scratch your head a while if taking a look on UI > > that a > > > > DAG is “stuck” until you realize all the options you have configured. > > > > Exposing a “why is stuck” in a user friendly manner might be another > > > level > > > > of complexity in this case. > > > > > > > > As the state model might touch a lot of code and there might be a > > longer > > > > discussion needed, would it be a need to raise an AIP for this? There > > > might > > > > be a lot more (external, provider??) dependencies adjusting the state > > > model? > > > > > > > > Mit freundlichen Grüßen / Best regards > > > > > > > > Jens Scheffler > > > > > > > > Deterministik open Loop (XC-DX/ETV5) > > > > Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen | > > > > GERMANY | www.bosch.com<http://www.bosch.com> > > > > Tel. +49 711 811-91508 | Mobil +49 160 90417410 | > > > > [email protected]<mailto:[email protected]> > > > > > > > > Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; > > > > Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer; > > > > Geschäftsführung: Dr. Stefan Hartung, > > > > Dr. Christian Fischer, Dr. Markus Forschner, Stefan Grosch, Dr. > Markus > > > > Heyn, Dr. Tanja Rückert > > > > > > > > From: Ryan Hatter <[email protected]> > > > > Sent: Donnerstag, 28. September 2023 23:59 > > > > To: [email protected] > > > > Subject: The "no_status" state > > > > > > > > Over the last couple weeks I've come across a rather tricky problem a > > few > > > > times. One DAG run gets "stuck" in the queued state, while subsequent > > DAG > > > > runs will be stuck running (screenshot below). One of these issues > was > > > > caused by `max_active_runs` being met when a task instance from a > > > > previously run DAG was cleared, and one of the tasks had > > > > `depends_on_past=True`. This caused the DAG run to be stuck in queued > > in > > > > perpetuity until it was realized that the task that wasn't getting > > > > scheduled needed the failed task in the preceding DAG run to be > re-run, > > > > which in turn causes the stuck running DAG runs to be stuck in > running. > > > > which caused quite a bit of confusion and stress. > > > > > > > > Given that Airflow is pretty burnt out on task instance states and > > > colors, > > > > I propose replacing "no_status" with "dependencies_not_met" and > > surfacing > > > > dependencies in the grid view instead of forcing users to already > know > > > > where to look (i.e. "more details" task instance details). Now that I > > > typed > > > > it out, I'm not sure there should be a reason for the "more details" > > > button > > > > and not just laying out all of a task instance's details in the grid > > view > > > > similar to how the graph and code views are now included in the grid > > > view. > > > > > > > > Anyway, I wanted to solicit feedback before I open an issue / start > > work > > > > on this. > > > > > > > > [cid:ii_ln3phzoe0] > > > > > > > > > >
