Hi Sung,

Thanks for collecting all this one place, it's a great summary of the problem 
with SLAs. After 2.6 was out one of the next things we (with my "one of the 
Astronomer people who work on Airflow" hat on here) wanted to look at SLAs in 
general, and specifically extending Dataset's to have SLAs too, not just DAGs, 
so it's great to see we're not alone in wanting to fix this!
We do also have a question to answer of if this is a breaking change or not? I 
don't think firing the SLA miss "on time" (rather than later) is breaking, but 
https://m.xkcd.com/1172/ -- I still vote we make this change.
I agree that Idea 1 and 2 are't ideal for the reasons you mention.
I think a single DAG-level SLA defintion makes sense, but then I've never 
personally used SLAs (because of how broken I felt they were!)
Speaking to the proposed/sample changes however I think we can do it 
differently. So as I hinted at in some of the comments I left on the doc, I 
think the existing callback mechanism in the Executor could be extended here. 
For instance your CallbackParams class is very similar to the 
DagCallbackRequest class. If we wanted a state on there we could add it to that 
table, rather than needing a new DB table.
As background: Previously (changed sometime around 2.3? 2.4?) DAG level 
callbacks were sent directly to the dag processor from the scheduler. However 
with the work on AIP-44 and the ability to separate out the dag processor from 
the scheduler and run more than one of them that was changed.
What changed along with the change was that the Scheduler now sends callback 
events to the Executor, and (for all current executor classes) the executor 
writes them to the DB such that the DagFileProcessor picks them up.
I also think it's worth splitting the AIP up in to two -- one is to fix the SLA 
behaviour, but to leave the mechanism callbacks are processed (via Executor to 
DB to DagFileProcessor) unchanged. And then as a second/parallel AIP would be 
to introduce the concept of "workloads" to executors.
Right now the only workload that the Executor and workers have is "run this 
task", but I have thought of extending it to have "run this 
SLA/DAG-level-failure/success callback" as a new workload type, which would 
address one of the points in your writeup about extra re-parse of DAG files 
(though it's worth nothing that the DAG file is executed/parsed every time a 
task runs, the result is just not stored in the DB for those)
Thoughts?
Ash

On Mar 17 2023, at 5:30 am, Jarek Potiuk <[email protected]> wrote:
> Thanks Sung Yun,
>
> Fantastic document!
> This is one incredibly well thought out document and proposal. I like
> the detailed analysis of the problem (finally it answers very clearly
> and explicitly why the current SLA mechanism is essentially broken).
> It also very well reflects the changes in our approach with
> dataset-driven scheduling where we promote using "micro-pipeline"
> approach, where you split more monolithic DAGs to a number of smaller
> dags linked by dataset dependencies. Before that, DAG-level SLAs would
> not be as complete. Simply without the "micro-pipeline" approach,
> DAG-level SLAs would be far too coarse-grained and task-level SLAs
> were too fine-grained. But when you pair dataset-driven scheduling and
> DAG level SLA, you generally leave the decision to the user what
> granularity they would like to have for those. And this is great.
>
> I think the doc is very close to be AIP-ready, I think what I miss are
> few things only:
>
> * an example and proposal on how the definition of the "new" SLA would
> be different from the current one and I think this proposal should
> already include immediate deprecation of the old SLA mechanism (for
> the reasons described in the docs).
> * also you mention that the current SLA approach can be replaced with
> time Triggers. Few examples of how you could do that, showing a path
> for the users on how they could migrate their current sla approach
> would be incredibly useful and it would show that the deprecation path
> is viable (and easy) even if someone would like to keep task level SLA
> implemented this way and justify the immediate deprecation.
>
> J.
> On Fri, Mar 17, 2023 at 5:22 AM Sung Yun <[email protected]> wrote:
> >
> > Dear Airflow community,
> >
> > I've been looking to make a new SLA proposal by crowdsourcing all the known
> > issues and proposals that have been raised over the last couple of years.
> > In summary, I believe that defining SLAs at the task level puts too much
> > work on the scheduler, if we were to solve all of the known issues. Given
> > this clear downside, task-level callbacks may not be strictly necessary,
> > especially with tools like DateTimeTriggers that can substitute the
> > function of task-level SLA callbacks.
> >
> > On the other hand, I believe that SLAs defined at the DAG level will be
> > extremely useful as a 'catch-all' alert in case anything goes wrong in a
> > DAG (e.g. significant delays in task execution, tasks stuck in queued
> > state). In addition, SLAs defined at the DAG level will be incredibly
> > lightweight to detect and execute callbacks for. Hence, I'd like to propose
> > that SLAs be defined as a DAG-level attribute. If you are interested in
> > this feature, please take a look at my detailed proposal and example
> > implementations in the below Google Doc. Please feel free to reply to this
> > message, or simply leave comments in the Doc.
> >
> > https://docs.google.com/document/d/1drNaYmAy6GqC4WGGn4MNt6VqbOwVNm7jPfmr5Pc52AU/edit?usp=sharing
> >
> > Thank you!
> >
> > Github: syun64 <https://github.com/syun64>
> > --
> > Sung Yun
> > Cornell '20
> > Master of Engineering in Computer Science
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

Reply via email to