Re: [DISCUSS] Mechanism of SLA

Sung Yun Tue, 19 Sep 2023 04:44:50 -0700

Hi Daniel,

Thank you for following up with the assessment. That’s an incredibly valuable 
data point.

I know we may have some opportunity to talk about this topic more at the summit 
this week, but just for the sake of offering a reference of the other 
perspective, I would like to share this blog post where another user describes 
the SLA feature with the following words: ‘In Airflow’s context, SLA can be 
seen as “for how long your DAG can run before you need to do something about 
it”https://poatek.com/2022/10/19/how-to-fix-airflow-sla/

Which highlights the desire to use the SLA feature for the purpose of delay 
detection, as a soft timeout that does not kill the task and simply executes 
the defined callback.

If what you say is true, and there are indeed folks hoping to use the SLA 
defined on Airflow for bookkeeping the accurate count of SLA misses because 
they don’t want to do it outside of it, than I think it will be important for 
us to discuss at length and decide which of these two motivations we are 
prioritizing when finalizing the design of the next ‘SLA Feature’. Again, I 
feel that it is much more simple of an endeavor if we drop the sense of urgency 
if we are designing for accuracy, and vice versa. 

Or maybe we are better off having two separate implementations for the two - 
one that prioritizes urgency, and one that prioritizes accuracy in hindsight. 
And we can also continue to discuss at length what the difficulty is in trying 
to achieve both within a single feature as well.

Sent from my iPhone

> On Sep 19, 2023, at 1:19 AM, Daniel Standish 
> <daniel.stand...@astronomer.io.invalid> wrote:
> 
> I was able to chat with a couple folks about this. Small sample, but the
> sentiment was, "this is just a timeout".  In other words, if we're going to
> call this SLA, we really ought to evaluate against the "this thing should
> have run by" time and not the actual start time.  And, ideally, we should
> also have a way to enforce "this should have run by X time daily" (for
> example) even when it's a dataset-triggered or API-triggered dag with *no*
> schedule.
> 
> Like I said, it's a small number of folks I've talked to, so I don't have
> overwhelming confidence about this assessment.  But I do think it's more
> likely than not that this would be the prevailing assessment were somehow
> able to get better data on this.

Re: [DISCUSS] Mechanism of SLA

Reply via email to