Re: [DISCUSS] Mechanism of SLA

Daniel Standish Tue, 12 Sep 2023 11:26:05 -0700

Some questions for you Sung.

I tried looking to understand why we needed to remove behavior 3 discussed
in AIP:


*[remove]* Task-level SLA measured from DAG-run scheduled start time


I'm just a little concerned that removing this would be a mistake because,
in my mind, part of the essence of what an SLA is, is an agreement about,
in essence, when your data should "be there". (or the "thing" done, etc)

The behavior replacing it seems a lot like the existing execution_timeout
behavior:

*[add]* Finally, I propose that we implement a new Task-level SLA feature,
> that is contained within the bounds of the task's lifetime, and is measured
> within the task itself.


[1] Do I understand that correctly?  If so, the question I would have is,
if we can already rely on execution timeout to tell us when an individual
task is taking too long, why do we need an SLA to tell us the same thing?

[2] Can you also clarify how "dag slas" are defined?  (behavior 1 in the
AIP).  Is the duration to be applied relative to scheduled time or start
time?

[3] I saw some discussion of using the triggerer component somehow in
this.  Is that a thing?  Can you share notes on implementation?  I looked
on the AIP but did not find anything.

[4] The fact that task sla cannot be used with deferrable operators seems
problematic to me.  Users tend to expect (rightly or not) something like
"parity" between deferrable and non-deferrable tasks.  SLAs would seem to
fall into a category where this seems like a reasonable expectation.

Some discussion on solutions...

I gather that the justification for removing behavior 3 is that it's too
complex / expensive.  I feel like there's got to be a way.   Thinking about
how we could implement behavior 3, there's two problems that sort of push
us into this perceived need to query everything all the time, in a manner
that is prohibitive.  One is the problem that the dag may not get
scheduled.  We need a way to raise if the dag is not scheduled X time after
it's "schedule *at*" date.  For the moment let's call that the dag
scheduling timeout.  What if, for each dag with such a scheduling timeout,
we launch a trigger that sleeps until that timeout and then when that runs
out it runs a single query to determine whether the dag run was scheduled
-- if not, raise.  Then, for the task part of things.... well if the dag is
assumed to be reliably scheduled (and this is protected by the first part
here), then we can, at the time of dag run scheduling, launch triggers for
the individual tasks.  And they would behave the same way.  Run until
timeout and then query.
Perhaps better yet, perhaps when a task completes, it can delete its
associated "sla" triggers, so that its sla trigger would be cancelled, and
no query ever has to run.  Same thing could be done for an overall dag SLA
-- at dag run completion could kill any trigger that is monitoring it's
execution duration.

Anyway, I understand that the AIP is already accepted, but just hoping we
can have the discussion, in case we can get to a place where the existing
"expected" sla behavior can perhaps be preserved in a performant manner.

Thanks

Re: [DISCUSS] Mechanism of SLA

Reply via email to