Thank you for the clarification Jarek :) I’ve updated the AIP on the Confluence page with your suggestion - please let me know what you folks think!
In summary, I think it will serve as a great way to maintain some capacity to measure a soft-timeout within a task. Obvious pros of this approach are its reliability and scalability. The down side is that I think that making it work with Deferrable Operators in an expected way will prove to be difficult. https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=247828059#content/view/247828059 Sent from my iPhone > On Jul 4, 2023, at 3:51 PM, Jarek Potiuk <ja...@potiuk.com> wrote: > > >> >> Which forking strategy are we exactly proposing? > > The important part is that you have a separate process that will run a > separate Python interpreter so that if the task runs a "C" code without a > loop, the "timer" thread will be able to stop it regardless (for timeout) > and one that can run "in-parallel" SLA. So lillely it is > > local task > | - timeout fork (kills both "chlidren" if fired) > | - sla timer (runs in parallel to task) > | - task code > > > Then when SLA timer fires, it will just notify - but let the task_code run. > When timeout fires it will kill both child processes. > > J. > > > > > >> On Wed, Jun 21, 2023 at 9:22 PM Sung Yun <sy...@cornell.edu> wrote: >> >> Hi Jarek, I've been mulling over the implementation of (3) task: >> time_limit_sla, and I have some follow up questions about the >> implementation. >> >> Which forking strategy are we exactly proposing? Currently, we invoke >> task.execute_callable within the taskinstance, which we can effectively >> think of as the parent process for the sake of this discussion. >> >> Are we proposing: >> Structure 1 >> parent: task.execute_callable >> └ child 1: sla timer >> └ child 2: execution_timeout timer >> >> Or: >> Structure 2 >> parent: looping process that parses signals from child Processes >> └ child 1: sla timer >> └ child 2: execution_timeout timer >> └ child 3: task.execute_callable >> >> And also, are we proposing that the callbacks be executed in the child >> processes (when the timers complete) or in the parent process? >> >> Pierre: great questions... >> >>> How hard would it be to spawn them when a task run with SLA configured as >> a >> normal workload on the worker ? >>> Maybe on a dedicated queue / worker ? >> >> My current thought is that having a well-abstracted subclass implementation >> of Deferrable Operator may make the most sense for now. I worry that having >> a configuration-driven way of creating sla monitoring tasks, where they are >> created behind the scenes, would create confusion in the user base. >> Especially so, if there is no dedicated worker pool that will completely >> isolate the monitoring tasks from the resource pool of normal tasks. So I'm >> curious to hear what options we would have in setting up a dedicated worker >> pool to compliment this idea. >> >> Sung >> >> On Tue, Jun 20, 2023 at 2:08 PM Pierre Jeambrun <pierrejb...@gmail.com> >> wrote: >> >>> This task_sla is more and more making me think of a ‘task’ on its own. It >>> would need to be run in parallel, non blocking, not overlap between each >>> other, etc… >>> >>> How hard would it be to spawn them when a task run with SLA configured >> as a >>> normal workload on the worker ? >>> Maybe on a dedicated queue / worker ? >>> >>>> On Tue 20 Jun 2023 at 16:47, Sung Yun <sy...@cornell.edu> wrote: >>> >>>> Thank you all for your continued engagement and input! It looks like >>>> Iaroslav's layout of 3 different labels of SLA's is helping us group >> the >>>> implementation into different categories, so I will organize my own >>>> responses in those logical groupings as well. >>>> >>>> 1. dag_sla >>>> 2. task_sla >>>> 3. task: time_limit_sla >>>> >>>> 1. dag_sla >>>> I am going to lean in on Jarek's support in driving us to agree on the >>> fact >>>> that, dag_sla seems like the only one that can stay within the >> scheduler >>>> without incurring an excessive burden on the core infrastructure. >>>> >>>>> So, I totally agree about dag level slas. It's very important to have >>> it >>>> and according to Sung Yun proposal it should be implemented not on the >>>> scheduler job level. >>>> >>>> In response to this, I want to clarify that I am specifically >>> highlighting >>>> that dag_sla is the only one that can be supported by the scheduler >> job. >>>> Dag_sla isn't a feature that exists right now, and my submission >> proposes >>>> exactly this! >>>> >>>> 2. task_sla >>>> I think Utkarsh's response really helped highlight another compounding >>>> issue with SLAs in Airflow, which is that users have such varying >>>> definition of SLAs, and what they want to do when that SLA is breached. >>>> On a high level, task_sla relies on a relationship between the dag_run, >>> and >>>> a specific task within that specific dag_run: it is the time between a >>>> dag_run's scheduled start time, and the actual start or end time of an >>>> individual task within that run. >>>> Hence, it is impossible for it to be computed in a distributed way that >>>> address all of the issues highlighted in the AIP, and needs to be >> managed >>>> by a central process that has access to the single source of truth. >>>> As Utkarsh suggests, I think this is perhaps doable as a separate >>> process, >>>> and probably would be much safer to do it within a separate process. >>>> My only concern is that we would be introducing a separate Airflow >>> process, >>>> that is strictly optional, but one that requires quite a large amount >> of >>>> investment in designing the right abstractions to meet user >> satisfaction >>>> and reliability guarantees. >>>> It will also require us to review the database's dag/dag_run/task >> tables' >>>> indexing model to make sure that continuous queries to the database >> will >>>> not overload it. >>>> This isn't simple, because we will have to select tasks in any state >>>> (FAILED, SUCCESS or RUNNING) that has not yet had their SLA evaluated, >>> from >>>> any dagRun (FAILED, SUCCESS or RUNNING), in order to make sure we don't >>>> miss any tasks - because in this paradigm, the concept of SLA >> triggering >>> is >>>> decoupled from a dagrun or task execution. >>>> A query that selects tasks in ANY state in ANY state of dag_run is >> bound >>> to >>>> be incredibly expensive - and I discuss this challenge in the >> Confluence >>>> AIP and the Google Doc. >>>> This will possibly be even more difficult to achieve, because we should >>>> have the capacity to support multiple processes since we now support >> High >>>> Availability in Airflow. >>>> So although setting up a separate process decouples the SLA evaluation >>> from >>>> the scheduler, we need to acknowledge that we may be introducing a >> heavy >>>> dependency on the metadata database. >>>> >>>> My suggestion to leverage the existing Triggerer process to design >>>> monitoring Deferrable Operators to execute SLA callbacks has the >> benefit >>> of >>>> reducing the load on the database while achieving similar goals, >> because >>> it >>>> registers the SLA monitoring operator as a TASK to the dag_run that it >> is >>>> associated with, and prevents the dag_run from completing if the SLA >> has >>>> not yet computed. This means that our query will be strictly limited to >>>> just the dagRuns in RUNNING state - this is a HUGE difference from >> having >>>> to query dagruns in all states in a separate process, because we are >>> merely >>>> attaching a few additional tasks to be executed into existing dag_runs. >>>> >>>> In summary: I'm open to this idea, I just have not been able to think >> of >>> a >>>> way to manage this without overloading the scheduler, or the database. >>>> >>>> 3. task: time_limit_sla >>>> Jarek: That sounds like a great idea that we could group into this AIP >> - >>> I >>>> will make some time to add some code snippets into the AIP to make this >>>> idea a bit clearer to everyone reading it in preparation for the vote >>>> >>>> >>>> Sung >>>> >>>> On Sun, Jun 18, 2023 at 9:38 PM utkarsh sharma <utkarshar...@gmail.com >>> >>>> wrote: >>>> >>>>>> >>>>>> This can be IMHO implemented on the task level. We currently have >>>> timeout >>>>>> implemented this way - whenever we start the task, we can have a >>> signal >>>>>> handler registered with "real" time registered that will cancel the >>>> task. >>>>>> But I can imagine similar approach with signal and propagate the >>>>>> information that task exceeded the time it has been allocated but >>> would >>>>> not >>>>>> stop it, just propagate the information (in a form of current way >> we >>> do >>>>>> callbacks for example, or maybe (even better) only run it in the >>>> context >>>>> of >>>>>> task to signal "soft timeout" per task: >>>>>> >>>>>>> signal.signal(signal.SIGALRM, self.handle_timeout) >>>>>>> signal.setitimer(signal.ITIMER_REAL, self.seconds) >>>>>> >>>>>> This has an advantage that it is fully distributed - i.e. we do not >>>> need >>>>>> anything to monitor 1000s of tasks running to decide if SLA has >> been >>>>>> breached. It's the task itself that will get the "soft" timeout and >>>>>> propagate it (and then whoever receives the callback can decide >> what >>> to >>>>> do >>>>>> next - and this "callback" can happen in either the task context or >>> it >>>>>> could be done in a DagFileProcessor context as we do currently - >>> though >>>>> the >>>>>> in-task processing seems much more distributed and scalable in >>> nature. >>>>>> There is one watch-out here that this is not **guaranteed** to >> work, >>>>> there >>>>>> are cases, that we already saw that the SIGALRM is not going to be >>>>> handled >>>>>> locally, when the task uses long running C-level function that is >> not >>>>>> written in the way to react to signals generated in Python (thinks >>>>>> low-level long-running Pandas c-method call that does not check >>> signal >>>>> in a >>>>>> long-running-loop. That however probably could be handled by one >> more >>>>>> process fork and have a dedicated child process that would monitor >>>>> running >>>>>> tasks from a separate process - and we could actually improve both >>>>> timeout >>>>>> and SLA handling by introducing such extra forked process to handle >>>>>> timeout/task level time_limit_sla, so IMHO this is an opportunity >> to >>>>>> improve things. >>>>>> >>>>> >>>>> >>>>> Building on what Jarek mentioned, If we can enable the scheduler to >>> emit >>>>> events for DAGs with SLA configured in cases of >>>>> 1. DAG starts executing >>>>> 2. Task - start executing(for every task) >>>>> 3. Task - stop executing(for every task) >>>>> 4. DAG stops executing >>>>> >>>>> And have a separate process(per dag run) that can keep monitoring >> such >>>>> events and execute a callback in the following circumstances: >>>>> 1. DAG level SLA miss >>>>> - When the entire DAG didn't finish in a specific time >>>>> 2. Task-level SLA miss >>>>> - Counting time from the start of the DAG to the end of a task. >>>>> - Start of a task to end of a task. >>>>> >>>>> I think above approch should be addressing issues listed in AIP-57 >>>>> < >>>>> >>>> >>> >> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-57+SLA+as+a+DAG-Level+Feature >>>>>> >>>>> 1. Since we have a septate process we no longer have to wait for the >>>>> Tasks/DAG to be in the SUCCESS/SKIPPED state or any other terminal >>> state. >>>>> In this process, we can have a loop executing periodically in >> intervals >>>> of >>>>> ex- 1sec to monitor SLA misses by monitoring events data and >>>> task_instance >>>>> table. >>>>> 2. For Manually/Dataset triggered dags, we no longer have a >> dependency >>>> on a >>>>> fixed schedule, everything we need to evaluate SLA miss is already >>>> present >>>>> in Events for that specific DAG. >>>>> 3. This approach also enables us to run callbacks on the task level. >>>>> 4. We can remove calls to sla_miss_callbacks every time is called >>>>> *dag_run.update_state* >>>>> >>>>> A couple of things I'm not about are - >>>>> 1. Where to execute the callbacks. Executing a callback in the same >>>> process >>>>> as the monitoring process can have a downside if the callback takes >>> much >>>>> time to execute it will probably cause other SLA callbacks to be >>> delayed. >>>>> 2. Context of execution of callback, we have to maintain the same >>> context >>>>> in which the callback is defined. >>>>> >>>>> Would love to know other people's thoughts on this :) >>>>> >>>>> Thanks, >>>>> Utkarsh Sharma >>>>> >>>>> >>>>> On Sun, Jun 18, 2023 at 4:08 PM Iaroslav Poskriakov < >>>>> yaroslavposkrya...@gmail.com> wrote: >>>>> >>>>>> I want to say that airflow is a very popular project and the ways >> of >>>>>> calculating SLA are different. Because of different business cases. >>> And >>>>> if >>>>>> it's possible we should make most of them from the box. >>>>>> >>>>>> вс, 18 июн. 2023 г. в 13:30, Iaroslav Poskriakov < >>>>>> yaroslavposkrya...@gmail.com>: >>>>>> >>>>>>> So, I totally agree about dag level slas. It's very important to >>> have >>>>> it >>>>>>> and according to Sung Yun proposal it should be implemented not >> on >>>> the >>>>>>> scheduler job level. >>>>>>> >>>>>>> Regarding the second way of determining SLA: <task state STARTED> >>> --> >>>>>>> ..<doesn't matter what happened>.. --> <task state SUCCESS>. >>>>>>> It's very helpful in the way when we want to achieve not >> technical >>>> SLA >>>>>> but >>>>>>> business SLA for the team which is using that DAG. Because >> between >>>>> those >>>>>>> two states anything could happen and at the end we might want to >>>>>> understand >>>>>>> high level SLA for the task. Because it doesn't matter for >>> business I >>>>>> guess >>>>>>> that path of states of the task was something like: STARTED -> >>>> RUNNING >>>>> -> >>>>>>> FAILED -> RUNNING -> FAILED -> RUNNING -> SUCCESS. And in case >> when >>>>>>> something similar is happening it can be helpful to have an >>>> opportunity >>>>>> of automatically >>>>>>> recognizing that the expected time for the task crossed the >>> border. >>>>>>> >>>>>>> I agree that for the scheduler it can be too heavy. And also for >>> that >>>>>>> purpose we need to have some process which is running in parallel >>>> with >>>>>> the >>>>>>> task. It can be one more job for example which is running on the >>> same >>>>>>> machine as Scheduler, or not on the same. >>>>>>> >>>>>>> >>>>>>> About the third part of my proposal - time for the task in the >>>>>>> RUNNING state. I agree with you, Jarek. We can implement it on >> the >>>> task >>>>>>> level. For me it seems good. >>>>>>> >>>>>>> Yaro1 >>>>>>> >>>>>>> вс, 18 июн. 2023 г. в 08:12, Jarek Potiuk <ja...@potiuk.com>: >>>>>>> >>>>>>>> I am also for DAG level SLA only (but maybe there are some >>> twists). >>>>>>>> >>>>>>>> And I hope (since Sung Yun has not given up on that) - maybe >> that >>> is >>>>> the >>>>>>>> right time that others here will chime in and maybe it will let >>> the >>>>> vote >>>>>>>> go >>>>>>>> on? I think it would be great to get the SLA feature sorted out >> so >>>>> that >>>>>> we >>>>>>>> have a chance to stop answering ("yeah, we know SLA is broken, >> it >>>> has >>>>>>>> always been"). It would be nice to say "yeah the old deprecated >>> SLA >>>> is >>>>>>>> broken, but we have this new mechanism(s) that replace it". The >>> one >>>>>>>> proposed by Sung has a good chance of being such a replacement. >>>>>>>> >>>>>>>> I think having a task-level SLA managed by the Airflow framework >>>> might >>>>>>>> indeed be too costly and does not fit well in the current >>>>> architecture. >>>>>> I >>>>>>>> think attempting to monitor how long a given task runs by the >>>>> scheduler >>>>>> is >>>>>>>> simply a huge overkill. Generally speaking - scheduler (as >>>> surprising >>>>> it >>>>>>>> might be for anyone) does not monitor executing tasks (at least >>>>>>>> principally >>>>>>>> speaking). It merely submits the tasks to execute to executor >> and >>>> let >>>>>>>> executor handle all kinds of monitoring of what is being >> executed >>>>> when, >>>>>>>> and >>>>>>>> then - depending on the different types of executors there are >>>> various >>>>>>>> conditions when and how task is being executed, and various ways >>> how >>>>> you >>>>>>>> can define different kinds of task SLAs. Or at least this is >> how I >>>>> think >>>>>>>> about the distributed nature of Airflow on a "logical" level. >> Once >>>>> task >>>>>> is >>>>>>>> queued for execution, the scheduler takes its hands off and >> turns >>>> its >>>>>>>> attention to tasks that are not yet scheduled and should be or >>> tasks >>>>>> that >>>>>>>> are scheduled but not queued yet. >>>>>>>> >>>>>>>> But maybe some of the SLA "task" expectations can be >> implemented >>>> in a >>>>>>>> limited version serving very limited cases on a task level? >>>>>>>> >>>>>>>> Referring to what Yaro1 wrote: >>>>>>>> >>>>>>>>> 1. It doesn't matter for us how long we are spending time on >>> some >>>>>>>> specific >>>>>>>> task. It's important to have an understanding of the lag between >>>>>>>> execution_date of dag and success state for the task. We can >> call >>> it >>>>>>>> dag_sla. It's similar to the current implementation of >>> manage_slas. >>>>>>>> >>>>>>>> This is basically what Sung proposes, I believe. >>>>>>>> >>>>>>>> >>>>>>>>> 2. It's important to have an understanding and managing how >> long >>>>> some >>>>>>>> specific task is working. In my opinion working is the state >>> between >>>>>> task >>>>>>>> last start_date and task first (after last start_date) SUCCESS >>>> state. >>>>> So >>>>>>>> for example for the task which is placed in FAILED state we >> still >>>> have >>>>>> to >>>>>>>> check an SLA in that strategy. We can call it task_sla. >>>>>>>> >>>>>>>> I am not sure if I understand it, but If I do, then this is the >>>> "super >>>>>>>> costly" SLA processing that we should likely avoid. I would love >>> to >>>>> hear >>>>>>>> however, what are some specific use cases that we could show >> here, >>>>> maybe >>>>>>>> there are other ways we can achieve similar things. >>>>>>>> >>>>>>>> >>>>>>>>> 3. Sometimes we need to manage time for the task in the >> RUNNING >>>>> state. >>>>>>>> We >>>>>>>> can call it time_limit_sla. >>>>>>>> >>>>>>>> This can be IMHO implemented on the task level. We currently >> have >>>>>> timeout >>>>>>>> implemented this way - whenever we start the task, we can have a >>>>> signal >>>>>>>> handler registered with "real" time registered that will cancel >>> the >>>>>> task. >>>>>>>> But I can imagine similar approach with signal and propagate the >>>>>>>> information that task exceeded the time it has been allocated >> but >>>>> would >>>>>>>> not >>>>>>>> stop it, just propagate the information (in a form of current >> way >>> we >>>>> do >>>>>>>> callbacks for example, or maybe (even better) only run it in the >>>>> context >>>>>>>> of >>>>>>>> task to signal "soft timeout" per task: >>>>>>>> >>>>>>>>> signal.signal(signal.SIGALRM, self.handle_timeout) >>>>>>>>> signal.setitimer(signal.ITIMER_REAL, self.seconds) >>>>>>>> >>>>>>>> This has an advantage that it is fully distributed - i.e. we do >>> not >>>>> need >>>>>>>> anything to monitor 1000s of tasks running to decide if SLA has >>> been >>>>>>>> breached. It's the task itself that will get the "soft" timeout >>> and >>>>>>>> propagate it (and then whoever receives the callback can decide >>> what >>>>> to >>>>>> do >>>>>>>> next - and this "callback" can happen in either the task context >>> or >>>> it >>>>>>>> could be done in a DagFileProcessor context as we do currently - >>>>> though >>>>>>>> the >>>>>>>> in-task processing seems much more distributed and scalable in >>>> nature. >>>>>>>> There is one watch-out here that this is not **guaranteed** to >>> work, >>>>>> there >>>>>>>> are cases, that we already saw that the SIGALRM is not going to >> be >>>>>> handled >>>>>>>> locally, when the task uses long running C-level function that >> is >>>> not >>>>>>>> written in the way to react to signals generated in Python >> (thinks >>>>>>>> low-level long-running Pandas c-method call that does not check >>>> signal >>>>>> in >>>>>>>> a >>>>>>>> long-running-loop. That however probably could be handled by one >>>> more >>>>>>>> process fork and have a dedicated child process that would >> monitor >>>>>> running >>>>>>>> tasks from a separate process - and we could actually improve >> both >>>>>> timeout >>>>>>>> and SLA handling by introducing such extra forked process to >>> handle >>>>>>>> timeout/task level time_limit_sla, so IMHO this is an >> opportunity >>> to >>>>>>>> improve things. >>>>>>>> >>>>>>>> I would love to hear what others think about it :)? I think our >>> SLA >>>>> for >>>>>>>> fixing SLA is about to run out. >>>>>>>> >>>>>>>> >>>>>>>> J. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 15, 2023 at 4:05 PM Sung Yun <sy...@cornell.edu> >>> wrote: >>>>>>>> >>>>>>>>> Hello! >>>>>>>>> >>>>>>>>> Thank you very much for the feedback on the proposal. I’ve >> been >>>>> hoping >>>>>>>> to >>>>>>>>> get some more traction on this proposal, so it’s great to hear >>>> from >>>>>>>> another >>>>>>>>> user of the feature. >>>>>>>>> >>>>>>>>> I understand that there’s a lot of support for keeping a >> native >>>> task >>>>>>>> level >>>>>>>>> SLA feature, and I definitely agree with that sentiment. Our >>>>>>>> organization >>>>>>>>> very much relies on Airflow to evaluate ‘task_sla’ in order to >>>> keep >>>>>>>> track >>>>>>>>> of which tasks in each dags failed to succeed by an expected >>> time. >>>>>>>>> >>>>>>>>> In the AIP I put together on the Confluence page, and in the >>>> Google >>>>>>>> docs, >>>>>>>>> I have identified why the existing implementation of the task >>>> level >>>>>> SLA >>>>>>>>> feature can be problematic and is often misleading for Airflow >>>>> users. >>>>>>>> The >>>>>>>>> feature is also quite costly for Airflow scheduler and >>>>> dag_processor. >>>>>>>>> >>>>>>>>> In that sense, the discussion is not about whether or not >> these >>>> SLA >>>>>>>>> features are important to the users, but much more technical. >>> Can >>>> a >>>>>>>>> task-level feature be supported in a first-class way as a core >>>>> feature >>>>>>>> of >>>>>>>>> Airflow, or should it be implemented by the users, for example >>> as >>>>>>>>> independent tasks by leveraging Deferrable Operators. >>>>>>>>> >>>>>>>>> My current thought is that only Dag level SLAs can be >> supported >>>> in a >>>>>>>>> non-disruptive way by the scheduler, and that task level SLAs >>>> should >>>>>> be >>>>>>>>> handled outside of core Airflow infrastructure code. If you >>>> strongly >>>>>>>>> believe otherwise, I think it would be helpful if you could >>>> propose >>>>> an >>>>>>>>> alternative technical solution that solves many of the >> existing >>>>>>>> problems in >>>>>>>>> the task-level SLA feature. >>>>>>>>> >>>>>>>>> Sent from my iPhone >>>>>>>>> >>>>>>>>>> On Jun 13, 2023, at 1:10 PM, Ярослав Поскряков < >>>>>>>>> yaroslavposkrya...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Mechanism of SLA >>>>>>>>>> >>>>>>>>>> Hi, I read the previous conversation regarding SLA and I >> think >>>>>>>> removing >>>>>>>>> the >>>>>>>>>> opportunity to set sla for the task level will be a big >>> mistake. >>>>>>>>>> So, the proposed implementation of the task level SLA will >> not >>>> be >>>>>>>> working >>>>>>>>>> correctly. >>>>>>>>>> >>>>>>>>>> That's why I guess we have to think about the mechanism of >>> using >>>>>> SLA. >>>>>>>>>> >>>>>>>>>> I guess we should check three different cases in general. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 1. It doesn't matter for us how long we are spending time on >>>> some >>>>>>>>> specific >>>>>>>>>> task. It's important to have an understanding of the lag >>> between >>>>>>>>>> execution_date of dag and success state for the task. We can >>>> call >>>>> it >>>>>>>>>> dag_sla. It's similar to the current implementation of >>>>> manage_slas. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2. It's important to have an understanding and managing how >>> long >>>>>> some >>>>>>>>>> specific task is working. In my opinion working is the state >>>>> between >>>>>>>> task >>>>>>>>>> last start_date and task first (after last start_date) >> SUCCESS >>>>>> state. >>>>>>>> So >>>>>>>>>> for example for the task which is placed in FAILED state we >>>> still >>>>>>>> have to >>>>>>>>>> check an SLA in that strategy. We can call it task_sla. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 3. Sometimes we need to manage time for the task in the >>> RUNNING >>>>>>>> state. We >>>>>>>>>> can call it time_limit_sla. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Those three types of SLA will cover all possible cases. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> So we will have three different strategies for SLA. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I guess we can use for dag_sla that idea - >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> For task_sla and time_limit_sla I prefer to stay with using >>>>>>>> SchedulerJob >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Github: Yaro1 >>>>>>>>> >>>>>>>>> >>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >>>>>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Sung Yun >>>> Cornell Tech '20 >>>> Master of Engineering in Computer Science >>>> >>> >> >> >> -- >> Sung Yun >> Cornell Tech '20 >> Master of Engineering in Computer Science >>