Re: The "no_status" state

Ryan Hatter Tue, 28 Nov 2023 14:50:33 -0800

I think a good first step that wouldn't require an AIP would be surfacing a
task instance's dependencies in the task instance details "sub-view" of the
grid view. I've created an issue here:
https://github.com/apache/airflow/issues/35935


On Fri, Oct 20, 2023 at 9:35 AM Pierre Jeambrun <[email protected]>
wrote:

> Seems like a good idea. Some kind of "task diagnosis", in case the state is
> not settled to give more context to users.
>
> Happy to help on that one as well. I also think that a small AIP is
> required, the scope of change could be substantial.
>
> Best regards,
> Pierre
>
> Le jeu. 19 oct. 2023 à 17:05, Brent Bovenzi <[email protected]>
> a
> écrit :
>
> > Like what Jarek said, some of these dependencies might take a lot of work
> > to surface correctly. But I am happy to improve the grid and graph to
> show
> > more information, like integrating rendered_templates and more details
> into
> > the Grid view. Mind to open a github issue for some of those smaller
> tasks
> > so I don't forget to do it?
> >
> > I am also playing with some ways to show datasets and other external
> > dependencies better in grid/graph view too.
> >
> > On Thu, Oct 19, 2023 at 10:48 AM Jarek Potiuk <[email protected]> wrote:
> >
> > > I think it will be tricky to get all the reasons surfaced to the user
> why
> > > the task is not run. But surfacing it to the user is indeed a good
> idea.
> > > Currently this is only done by this FAQ response - showing possible
> > reasons
> > >
> > >
> >
> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#why-is-task-not-getting-scheduled
> > > - and I believe this is not a complete list after a number of
> > > features implemented since this FAQ was written.
> > >
> > > The question is open I think (and agree with Jens comments this should
> > be a
> > > small "AIP" level) is which of those we are able to deterministically
> > > detect. A bit of a problem here is (also as Jens mentioned) that in
> many
> > > cases the task in DB is simply skipped during scheduler because of some
> > of
> > > the reasons explained  in the FAQ (and some not explained). Sometimes
> > > simply the task will not be scheduled because the scheduler has not yet
> > had
> > > a chance to look at it due to performance reasons. That's why I believe
> > we
> > > really do not need a new status, but more automated analysis - in the
> > "more
> > > details" tab, when the user specifically asks for it. That could give
> the
> > > user possible reasons for this particular task. This would be much
> better
> > > to do it on "individual" task level when users asks "why this
> particular
> > > task is not scheduled" - because then you could query the DB and figure
> > it
> > > out, recording and determining the information upfront might not be
> > > possible from the performance reasons - simply because scheduler never
> > > really looks at all possible tasks (that would be prohibitively
> > expensive)
> > > - instead it effectively finds a subset the "good candidates to
> > schedule" -
> > > which is much smaller set to run queries for.
> > >
> > > Some of that could be deterministically determined today. For example
> the
> > > "upstream tasks are still running". Some of that might be a little
> "racy"
> > > though - because simply the system is continuously running - so what
> > caused
> > > the task to not be scheduled in the previous pass of scheduler, might
> not
> > > be valid any more (but there might still be other reasons). I think the
> > > difficult ones might require additional information recorded by the
> > > scheduler (for example scheduler recording the fact that it has
> completed
> > > the last pass with still remaining dag runs to look at or fact that the
> > > number of tasks seen in the last pass reached the global concurrency
> > > limits). But some of this might not be even possible to determine by
> > > scheduler without some major query changes (for example scheduler will
> > run
> > > the query including pools size - the way how pool query is done that
> you
> > > simply select "pool size" eligible tasks and you have no idea if there
> > were
> > > more that there are more tasks that were excluded from the result (nor
> > > which tasks they were). This is where looking at individual tasks and
> > > working out "backwards" - guessing why might be needed. But  possibly
> it
> > > could be helped with some extra information stored by the scheduler.
> > >
> > > I think we will not have a complete and fully accurate picture, but I
> > think
> > > iteratively we could get this better and better.
> > >
> > > J
> > >
> > >
> > > On Mon, Oct 16, 2023 at 11:55 PM Oliveira, Niko
> > > <[email protected]>
> > > wrote:
> > >
> > > > I really like this idea as well! One of the _the most common_
> > questions I
> > > > get from people managing an Airflow env is "Why is my task stuck in
> > state
> > > > X". Anything we can do to make that more discoverable and user
> > friendly,
> > > > especially in the UI instead of (or in addition to) logs would be
> > > fantastic!
> > > >
> > > > Thanks to Jens for having a think and pointing out a lot of the
> > > > implications, I agree a quick AIP might be nice for this one.
> > > >
> > > > Cheers,
> > > > Niko
> > > >
> > > > ________________________________
> > > > From: Scheffler Jens (XC-DX/ETV5) <[email protected]
> > .INVALID>
> > > > Sent: Thursday, September 28, 2023 10:36:00 PM
> > > > To: [email protected]
> > > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] The "no_status" state
> > > >
> > > > CAUTION: This email originated from outside of the organization. Do
> not
> > > > click links or open attachments unless you can confirm the sender and
> > > know
> > > > the content is safe.
> > > >
> > > >
> > > >
> > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > externe.
> > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > > pouvez
> > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> certain
> > > que
> > > > le contenu ne présente aucun risque.
> > > >
> > > >
> > > >
> > > > Hi Ryan,
> > > >
> > > > I really like the idea of exposing some more scheduler details. More
> > > > transparency in scheduling also in the UI would help the user in (1)
> > > seeing
> > > > and understanding what is going on and (2) reduces the need to crawl
> > for
> > > > logs and raise support tickets if status is “strange”. I often also
> see
> > > > this as a problem. This is also sometimes generating a bit of “mis
> > trust”
> > > > in the scheduler stability.
> > > >
> > > > From point of scheduler “overhead” I assume as long as we are not
> > making
> > > a
> > > > “full scan” just to ensure that each and every task is always
> > up-to-date
> > > > (Scheduler stops processing today after enough tasks have been
> > processes
> > > in
> > > > a loop or if scheduling limits are reached) this is OK for me and on
> > the
> > > > code side does not seem to be much overhead.
> > > > I have a bit of fear on the other hand that very many frequent
> updates
> > > > need to happen on the DB as another state would need to be written.
> So
> > > more
> > > > DB round trips are needed. This might hit performance for large DAGs
> or
> > > > cases where DAGs are scheduled. So at least it would need to filter
> to
> > > > update the state to DB only if changed to keep performance impact
> > > minimal.
> > > >
> > > > From point of naming I still think “no status” is good to indicate
> that
> > > > scheduler did not digest anything, maybe task was never looked at
> > because
> > > > scheduler actually is really stuck or too busy getting there. I would
> > > > propose if scheduler passes along a task and decides that it is not
> > ready
> > > > to schedule to have an additional state calling e.g. “not_ready” in
> the
> > > > state model between “none” and “scheduled”.
> > > >
> > > > Finally on the other hand, adding another state in the model, I am
> not
> > > > sure whether this 100% will help in the use case described by you.
> > Still
> > > > you might need to scratch your head a while if taking a look on UI
> > that a
> > > > DAG is “stuck” until you realize all the options you have configured.
> > > > Exposing a “why is stuck” in a user friendly manner might be another
> > > level
> > > > of complexity in this case.
> > > >
> > > > As the state model might touch a lot of code and there might be a
> > longer
> > > > discussion needed, would it be a need to raise an AIP for this? There
> > > might
> > > > be a lot more (external, provider??) dependencies adjusting the state
> > > model?
> > > >
> > > > Mit freundlichen Grüßen / Best regards
> > > >
> > > > Jens Scheffler
> > > >
> > > > Deterministik open Loop (XC-DX/ETV5)
> > > > Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen |
> > > > GERMANY | www.bosch.com<http://www.bosch.com>
> > > > Tel. +49 711 811-91508 | Mobil +49 160 90417410 |
> > > > [email protected]<mailto:[email protected]>
> > > >
> > > > Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> > > > Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> > > > Geschäftsführung: Dr. Stefan Hartung,
> > > > Dr. Christian Fischer, Dr. Markus Forschner, Stefan Grosch, Dr.
> Markus
> > > > Heyn, Dr. Tanja Rückert
> > > > 
> > > > From: Ryan Hatter <[email protected]>
> > > > Sent: Donnerstag, 28. September 2023 23:59
> > > > To: [email protected]
> > > > Subject: The "no_status" state
> > > >
> > > > Over the last couple weeks I've come across a rather tricky problem a
> > few
> > > > times. One DAG run gets "stuck" in the queued state, while subsequent
> > DAG
> > > > runs will be stuck running (screenshot below). One of these issues
> was
> > > > caused by `max_active_runs` being met when a task instance from a
> > > > previously run DAG was cleared, and one of the tasks had
> > > > `depends_on_past=True`. This caused the DAG run to be stuck in queued
> > in
> > > > perpetuity until it was realized that the task that wasn't getting
> > > > scheduled needed the failed task in the preceding DAG run to be
> re-run,
> > > > which in turn causes the stuck running DAG runs to be stuck in
> running.
> > > > which caused quite a bit of confusion and stress.
> > > >
> > > > Given that Airflow is pretty burnt out on task instance states and
> > > colors,
> > > > I propose replacing "no_status" with "dependencies_not_met" and
> > surfacing
> > > > dependencies in the grid view instead of forcing users to already
> know
> > > > where to look (i.e. "more details" task instance details). Now that I
> > > typed
> > > > it out, I'm not sure there should be a reason for the "more details"
> > > button
> > > > and not just laying out all of a task instance's details in the grid
> > view
> > > > similar to how the graph and code views are now included in the grid
> > > view.
> > > >
> > > > Anyway, I wanted to solicit feedback before I open an issue / start
> > work
> > > > on this.
> > > >
> > > > [cid:ii_ln3phzoe0]
> > > >
> > >
> >
>

Re: The "no_status" state

Reply via email to