Re: [DISCUSS] Airflow - New SLA AIP

Sung Yun (BLOOMBERG/ 120 PARK) Fri, 17 Mar 2023 07:39:52 -0700

Hi Ash and Jarek,

Thank you both for taking a look at the proposal, and even more so for the 
supportive response! I'm very excited to finalize the implementation details 
and formalize this proposal.


Ash - Thank you for the catch! I think I must have missed this detail 
completely because I'm not using standalone_dag_processors. It looks like the 
callbacks are sent through the PipeCallbackSink if we aren't using 
standalone_dag_processors, while these are materialized as DB objects if we use 
standalone_dag_processors - but from what I can understand the abstraction of 
sending and retrieving the callbacks are similar in both scenarios.
And you are right - I can do away with the CallbackParams - I think I thought I 
needed it at first to organize the parameters, but I think I could just create 
the DagCallbackRequest and update the existing attributes on the callback as we 
evaluate sla miss and success/failure state of the DAG.(I've made this update 
on the proposal)

Jarek - thank you for the suggestions - those are great suggestions in 
enhancing the proposal... Regarding your first point - how does deprecating an 
existing feature work for Airflow? Do we need to give a grace period of a minor 
or major patch over which we continue to support the feature after marking it 
for deprecation? Or do we remove them right away on a major version / minor 
version update?

Also, do you folks have any thoughts on my last point listed in 'Other 
Considerations' section? I've highlighted the section to make it easier to 
identify, and I think it's an edge case that would be worth thinking about.

------------
Sung Yun

sent from my work email


From: [email protected] At: 03/17/23 07:26:04 UTC-4:00To:  [email protected]
Subject: Re: [DISCUSS] Airflow - New SLA AIP

Yep. I am with Ash when it comes to workload on executors. This has
long been discussed and theorised that we should do it this way, that
would be great time to unify that and make it future-proof for other
types of workload we might and could have if this is
"first-class-citizen". That would simplify a lot of the subtle issues
around where and when things get executed.

J.

On Fri, Mar 17, 2023 at 12:10 PM Ash Berlin-Taylor <[email protected]> wrote:
>
> Hi Sung,
>
> Thanks for collecting all this one place, it's a great summary of the problem 
with SLAs. After 2.6 was out one of the next things we (with my "one of the 
Astronomer people who work on Airflow" hat on here) wanted to look at SLAs in 
general, and specifically extending Dataset's to have SLAs too, not just DAGs, 
so it's great to see we're not alone in wanting to fix this!
> We do also have a question to answer of if this is a breaking change or not? 
I don't think firing the SLA miss "on time" (rather than later) is breaking, 
but https://m.xkcd.com/1172/ -- I still vote we make this change.
> I agree that Idea 1 and 2 are't ideal for the reasons you mention.
> I think a single DAG-level SLA defintion makes sense, but then I've never 
personally used SLAs (because of how broken I felt they were!)
> Speaking to the proposed/sample changes however I think we can do it 
differently. So as I hinted at in some of the comments I left on the doc, I 
think the existing callback mechanism in the Executor could be extended here. 
For instance your CallbackParams class is very similar to the 
DagCallbackRequest class. If we wanted a state on there we could add it to that 
table, rather than needing a new DB table.
> As background: Previously (changed sometime around 2.3? 2.4?) DAG level 
callbacks were sent directly to the dag processor from the scheduler. However 
with the work on AIP-44 and the ability to separate out the dag processor from 
the scheduler and run more than one of them that was changed.
> What changed along with the change was that the Scheduler now sends callback 
events to the Executor, and (for all current executor classes) the executor 
writes them to the DB such that the DagFileProcessor picks them up.
> I also think it's worth splitting the AIP up in to two -- one is to fix the 
SLA behaviour, but to leave the mechanism callbacks are processed (via Executor 
to DB to DagFileProcessor) unchanged. And then as a second/parallel AIP would 
be to introduce the concept of "workloads" to executors.
> Right now the only workload that the Executor and workers have is "run this 
task", but I have thought of extending it to have "run this 
SLA/DAG-level-failure/success callback" as a new workload type, which would 
address one of the points in your writeup about extra re-parse of DAG files 
(though it's worth nothing that the DAG file is executed/parsed every time a 
task runs, the result is just not stored in the DB for those)
> Thoughts?
> Ash
>
> On Mar 17 2023, at 5:30 am, Jarek Potiuk <[email protected]> wrote:
> > Thanks Sung Yun,
> >
> > Fantastic document!
> > This is one incredibly well thought out document and proposal. I like
> > the detailed analysis of the problem (finally it answers very clearly
> > and explicitly why the current SLA mechanism is essentially broken).
> > It also very well reflects the changes in our approach with
> > dataset-driven scheduling where we promote using "micro-pipeline"
> > approach, where you split more monolithic DAGs to a number of smaller
> > dags linked by dataset dependencies. Before that, DAG-level SLAs would
> > not be as complete. Simply without the "micro-pipeline" approach,
> > DAG-level SLAs would be far too coarse-grained and task-level SLAs
> > were too fine-grained. But when you pair dataset-driven scheduling and
> > DAG level SLA, you generally leave the decision to the user what
> > granularity they would like to have for those. And this is great.
> >
> > I think the doc is very close to be AIP-ready, I think what I miss are
> > few things only:
> >
> > * an example and proposal on how the definition of the "new" SLA would
> > be different from the current one and I think this proposal should
> > already include immediate deprecation of the old SLA mechanism (for
> > the reasons described in the docs).
> > * also you mention that the current SLA approach can be replaced with
> > time Triggers. Few examples of how you could do that, showing a path
> > for the users on how they could migrate their current sla approach
> > would be incredibly useful and it would show that the deprecation path
> > is viable (and easy) even if someone would like to keep task level SLA
> > implemented this way and justify the immediate deprecation.
> >
> > J.
> > On Fri, Mar 17, 2023 at 5:22 AM Sung Yun <[email protected]> wrote:
> > >
> > > Dear Airflow community,
> > >
> > > I've been looking to make a new SLA proposal by crowdsourcing all the 
known
> > > issues and proposals that have been raised over the last couple of years.
> > > In summary, I believe that defining SLAs at the task level puts too much
> > > work on the scheduler, if we were to solve all of the known issues. Given
> > > this clear downside, task-level callbacks may not be strictly necessary,
> > > especially with tools like DateTimeTriggers that can substitute the
> > > function of task-level SLA callbacks.
> > >
> > > On the other hand, I believe that SLAs defined at the DAG level will be
> > > extremely useful as a 'catch-all' alert in case anything goes wrong in a
> > > DAG (e.g. significant delays in task execution, tasks stuck in queued
> > > state). In addition, SLAs defined at the DAG level will be incredibly
> > > lightweight to detect and execute callbacks for. Hence, I'd like to 
propose
> > > that SLAs be defined as a DAG-level attribute. If you are interested in
> > > this feature, please take a look at my detailed proposal and example
> > > implementations in the below Google Doc. Please feel free to reply to this
> > > message, or simply leave comments in the Doc.
> > >
> > > 
https://docs.google.com/document/d/1drNaYmAy6GqC4WGGn4MNt6VqbOwVNm7jPfmr5Pc52AU/
edit?usp=sharing
> > >
> > > Thank you!
> > >
> > > Github: syun64 <https://github.com/syun64>
> > > --
> > > Sung Yun
> > > Cornell '20
> > > Master of Engineering in Computer Science
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] Airflow - New SLA AIP

Reply via email to