Yep. I am with Ash when it comes to workload on executors. This has
long been discussed and theorised that we should do it this way, that
would be great time to unify that and make it future-proof for other
types of workload we might and could have if this is
"first-class-citizen". That would simplify a lot of the subtle issues
around where and when things get executed.

J.

On Fri, Mar 17, 2023 at 12:10 PM Ash Berlin-Taylor <[email protected]> wrote:
>
> Hi Sung,
>
> Thanks for collecting all this one place, it's a great summary of the problem 
> with SLAs. After 2.6 was out one of the next things we (with my "one of the 
> Astronomer people who work on Airflow" hat on here) wanted to look at SLAs in 
> general, and specifically extending Dataset's to have SLAs too, not just 
> DAGs, so it's great to see we're not alone in wanting to fix this!
> We do also have a question to answer of if this is a breaking change or not? 
> I don't think firing the SLA miss "on time" (rather than later) is breaking, 
> but https://m.xkcd.com/1172/ -- I still vote we make this change.
> I agree that Idea 1 and 2 are't ideal for the reasons you mention.
> I think a single DAG-level SLA defintion makes sense, but then I've never 
> personally used SLAs (because of how broken I felt they were!)
> Speaking to the proposed/sample changes however I think we can do it 
> differently. So as I hinted at in some of the comments I left on the doc, I 
> think the existing callback mechanism in the Executor could be extended here. 
> For instance your CallbackParams class is very similar to the 
> DagCallbackRequest class. If we wanted a state on there we could add it to 
> that table, rather than needing a new DB table.
> As background: Previously (changed sometime around 2.3? 2.4?) DAG level 
> callbacks were sent directly to the dag processor from the scheduler. However 
> with the work on AIP-44 and the ability to separate out the dag processor 
> from the scheduler and run more than one of them that was changed.
> What changed along with the change was that the Scheduler now sends callback 
> events to the Executor, and (for all current executor classes) the executor 
> writes them to the DB such that the DagFileProcessor picks them up.
> I also think it's worth splitting the AIP up in to two -- one is to fix the 
> SLA behaviour, but to leave the mechanism callbacks are processed (via 
> Executor to DB to DagFileProcessor) unchanged. And then as a second/parallel 
> AIP would be to introduce the concept of "workloads" to executors.
> Right now the only workload that the Executor and workers have is "run this 
> task", but I have thought of extending it to have "run this 
> SLA/DAG-level-failure/success callback" as a new workload type, which would 
> address one of the points in your writeup about extra re-parse of DAG files 
> (though it's worth nothing that the DAG file is executed/parsed every time a 
> task runs, the result is just not stored in the DB for those)
> Thoughts?
> Ash
>
> On Mar 17 2023, at 5:30 am, Jarek Potiuk <[email protected]> wrote:
> > Thanks Sung Yun,
> >
> > Fantastic document!
> > This is one incredibly well thought out document and proposal. I like
> > the detailed analysis of the problem (finally it answers very clearly
> > and explicitly why the current SLA mechanism is essentially broken).
> > It also very well reflects the changes in our approach with
> > dataset-driven scheduling where we promote using "micro-pipeline"
> > approach, where you split more monolithic DAGs to a number of smaller
> > dags linked by dataset dependencies. Before that, DAG-level SLAs would
> > not be as complete. Simply without the "micro-pipeline" approach,
> > DAG-level SLAs would be far too coarse-grained and task-level SLAs
> > were too fine-grained. But when you pair dataset-driven scheduling and
> > DAG level SLA, you generally leave the decision to the user what
> > granularity they would like to have for those. And this is great.
> >
> > I think the doc is very close to be AIP-ready, I think what I miss are
> > few things only:
> >
> > * an example and proposal on how the definition of the "new" SLA would
> > be different from the current one and I think this proposal should
> > already include immediate deprecation of the old SLA mechanism (for
> > the reasons described in the docs).
> > * also you mention that the current SLA approach can be replaced with
> > time Triggers. Few examples of how you could do that, showing a path
> > for the users on how they could migrate their current sla approach
> > would be incredibly useful and it would show that the deprecation path
> > is viable (and easy) even if someone would like to keep task level SLA
> > implemented this way and justify the immediate deprecation.
> >
> > J.
> > On Fri, Mar 17, 2023 at 5:22 AM Sung Yun <[email protected]> wrote:
> > >
> > > Dear Airflow community,
> > >
> > > I've been looking to make a new SLA proposal by crowdsourcing all the 
> > > known
> > > issues and proposals that have been raised over the last couple of years.
> > > In summary, I believe that defining SLAs at the task level puts too much
> > > work on the scheduler, if we were to solve all of the known issues. Given
> > > this clear downside, task-level callbacks may not be strictly necessary,
> > > especially with tools like DateTimeTriggers that can substitute the
> > > function of task-level SLA callbacks.
> > >
> > > On the other hand, I believe that SLAs defined at the DAG level will be
> > > extremely useful as a 'catch-all' alert in case anything goes wrong in a
> > > DAG (e.g. significant delays in task execution, tasks stuck in queued
> > > state). In addition, SLAs defined at the DAG level will be incredibly
> > > lightweight to detect and execute callbacks for. Hence, I'd like to 
> > > propose
> > > that SLAs be defined as a DAG-level attribute. If you are interested in
> > > this feature, please take a look at my detailed proposal and example
> > > implementations in the below Google Doc. Please feel free to reply to this
> > > message, or simply leave comments in the Doc.
> > >
> > > https://docs.google.com/document/d/1drNaYmAy6GqC4WGGn4MNt6VqbOwVNm7jPfmr5Pc52AU/edit?usp=sharing
> > >
> > > Thank you!
> > >
> > > Github: syun64 <https://github.com/syun64>
> > > --
> > > Sung Yun
> > > Cornell '20
> > > Master of Engineering in Computer Science
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to