Some questions for you Sung. I tried looking to understand why we needed to remove behavior 3 discussed in AIP:
*[remove]* Task-level SLA measured from DAG-run scheduled start time I'm just a little concerned that removing this would be a mistake because, in my mind, part of the essence of what an SLA is, is an agreement about, in essence, when your data should "be there". (or the "thing" done, etc) The behavior replacing it seems a lot like the existing execution_timeout behavior: *[add]* Finally, I propose that we implement a new Task-level SLA feature, > that is contained within the bounds of the task's lifetime, and is measured > within the task itself. [1] Do I understand that correctly? If so, the question I would have is, if we can already rely on execution timeout to tell us when an individual task is taking too long, why do we need an SLA to tell us the same thing? [2] Can you also clarify how "dag slas" are defined? (behavior 1 in the AIP). Is the duration to be applied relative to scheduled time or start time? [3] I saw some discussion of using the triggerer component somehow in this. Is that a thing? Can you share notes on implementation? I looked on the AIP but did not find anything. [4] The fact that task sla cannot be used with deferrable operators seems problematic to me. Users tend to expect (rightly or not) something like "parity" between deferrable and non-deferrable tasks. SLAs would seem to fall into a category where this seems like a reasonable expectation. Some discussion on solutions... I gather that the justification for removing behavior 3 is that it's too complex / expensive. I feel like there's got to be a way. Thinking about how we could implement behavior 3, there's two problems that sort of push us into this perceived need to query everything all the time, in a manner that is prohibitive. One is the problem that the dag may not get scheduled. We need a way to raise if the dag is not scheduled X time after it's "schedule *at*" date. For the moment let's call that the dag scheduling timeout. What if, for each dag with such a scheduling timeout, we launch a trigger that sleeps until that timeout and then when that runs out it runs a single query to determine whether the dag run was scheduled -- if not, raise. Then, for the task part of things.... well if the dag is assumed to be reliably scheduled (and this is protected by the first part here), then we can, at the time of dag run scheduling, launch triggers for the individual tasks. And they would behave the same way. Run until timeout and then query. Perhaps better yet, perhaps when a task completes, it can delete its associated "sla" triggers, so that its sla trigger would be cancelled, and no query ever has to run. Same thing could be done for an overall dag SLA -- at dag run completion could kill any trigger that is monitoring it's execution duration. Anyway, I understand that the AIP is already accepted, but just hoping we can have the discussion, in case we can get to a place where the existing "expected" sla behavior can perhaps be preserved in a performant manner. Thanks