Re: [DISCUSS] Mechanism of SLA

Sung Yun Fri, 07 Jul 2023 15:00:49 -0700

Thank you for the clarification Jarek :)

I’ve updated the AIP on the Confluence page with your suggestion - please let 
me know what you folks think!


In summary, I think it will serve as a great way to maintain some capacity to 
measure a soft-timeout within a task. Obvious pros of this approach are its 
reliability and scalability. The down side is that I think that making it work 
with Deferrable Operators in an expected way will prove to be difficult.

https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=247828059#content/view/247828059

Sent from my iPhone

> On Jul 4, 2023, at 3:51 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> 
>> 
>> Which forking strategy are we exactly proposing?
> 
> The important part is that you have a separate process that will run a
> separate Python interpreter so that if the task runs a "C" code without a
> loop, the "timer" thread will be able to stop it regardless (for timeout)
> and one that can run "in-parallel" SLA. So lillely it is
> 
> local task
> | - timeout fork (kills both "chlidren" if fired)
>    | - sla timer (runs in parallel to task)
>    | - task code
> 
> 
> Then when SLA timer fires, it will just notify - but let the task_code run.
> When timeout fires it will kill both child processes.
> 
> J.
> 
> 
> 
> 
> 
>> On Wed, Jun 21, 2023 at 9:22 PM Sung Yun <sy...@cornell.edu> wrote:
>> 
>> Hi Jarek, I've been mulling over the implementation of (3) task:
>> time_limit_sla, and I have some follow up questions about the
>> implementation.
>> 
>> Which forking strategy are we exactly proposing? Currently, we invoke
>> task.execute_callable within the taskinstance, which we can effectively
>> think of as the parent process for the sake of this discussion.
>> 
>> Are we proposing:
>> Structure 1
>> parent: task.execute_callable
>> └ child 1: sla timer
>> └ child 2: execution_timeout timer
>> 
>> Or:
>> Structure 2
>> parent: looping process that parses signals from child Processes
>> └ child 1: sla timer
>> └ child 2: execution_timeout timer
>> └ child 3: task.execute_callable
>> 
>> And also, are we proposing that the callbacks be executed in the child
>> processes (when the timers complete) or in the parent process?
>> 
>> Pierre: great questions...
>> 
>>> How hard would it be to spawn them when a task run with SLA configured as
>> a
>> normal workload on the worker ?
>>> Maybe on a dedicated queue / worker ?
>> 
>> My current thought is that having a well-abstracted subclass implementation
>> of Deferrable Operator may make the most sense for now. I worry that having
>> a configuration-driven way of creating sla monitoring tasks, where they are
>> created behind the scenes, would create confusion in the user base.
>> Especially so, if there is no dedicated worker pool that will completely
>> isolate the monitoring tasks from the resource pool of normal tasks. So I'm
>> curious to hear what options we would have in setting up a dedicated worker
>> pool to compliment this idea.
>> 
>> Sung
>> 
>> On Tue, Jun 20, 2023 at 2:08 PM Pierre Jeambrun <pierrejb...@gmail.com>
>> wrote:
>> 
>>> This task_sla is more and more making me think of a ‘task’ on its own. It
>>> would need to be run in parallel, non blocking, not overlap between each
>>> other, etc…
>>> 
>>> How hard would it be to spawn them when a task run with SLA configured
>> as a
>>> normal workload on the worker ?
>>> Maybe on a dedicated queue / worker ?
>>> 
>>>> On Tue 20 Jun 2023 at 16:47, Sung Yun <sy...@cornell.edu> wrote:
>>> 
>>>> Thank you all for your continued engagement and input! It looks like
>>>> Iaroslav's layout of 3 different labels of SLA's is helping us group
>> the
>>>> implementation into different categories, so I will organize my own
>>>> responses in those logical groupings as well.
>>>> 
>>>> 1. dag_sla
>>>> 2. task_sla
>>>> 3. task: time_limit_sla
>>>> 
>>>> 1. dag_sla
>>>> I am going to lean in on Jarek's support in driving us to agree on the
>>> fact
>>>> that, dag_sla seems like the only one that can stay within the
>> scheduler
>>>> without incurring an excessive burden on the core infrastructure.
>>>> 
>>>>> So, I totally agree about dag level slas. It's very important to have
>>> it
>>>> and according to Sung Yun proposal it should be implemented not on the
>>>> scheduler job level.
>>>> 
>>>> In response to this, I want to clarify that I am specifically
>>> highlighting
>>>> that dag_sla is the only one that can be supported by the scheduler
>> job.
>>>> Dag_sla isn't a feature that exists right now, and my submission
>> proposes
>>>> exactly this!
>>>> 
>>>> 2. task_sla
>>>> I think Utkarsh's response really helped highlight another compounding
>>>> issue with SLAs in Airflow, which is that users have such varying
>>>> definition of SLAs, and what they want to do when that SLA is breached.
>>>> On a high level, task_sla relies on a relationship between the dag_run,
>>> and
>>>> a specific task within that specific dag_run: it is the time between a
>>>> dag_run's scheduled start time, and the actual start or end time of an
>>>> individual task within that run.
>>>> Hence, it is impossible for it to be computed in a distributed way that
>>>> address all of the issues highlighted in the AIP, and needs to be
>> managed
>>>> by a central process that has access to the single source of truth.
>>>> As Utkarsh suggests, I think this is perhaps doable as a separate
>>> process,
>>>> and probably would be much safer to do it within a separate process.
>>>> My only concern is that we would be introducing a separate Airflow
>>> process,
>>>> that is strictly optional, but one that requires quite a large amount
>> of
>>>> investment in designing the right abstractions to meet user
>> satisfaction
>>>> and reliability guarantees.
>>>> It will also require us to review the database's dag/dag_run/task
>> tables'
>>>> indexing model to make sure that continuous queries to the database
>> will
>>>> not overload it.
>>>> This isn't simple, because we will have to select tasks in any state
>>>> (FAILED, SUCCESS or RUNNING) that has not yet had their SLA evaluated,
>>> from
>>>> any dagRun (FAILED, SUCCESS or RUNNING), in order to make sure we don't
>>>> miss any tasks - because in this paradigm, the concept of SLA
>> triggering
>>> is
>>>> decoupled from a dagrun or task execution.
>>>> A query that selects tasks in ANY state in ANY state of dag_run is
>> bound
>>> to
>>>> be incredibly expensive - and I discuss this challenge in the
>> Confluence
>>>> AIP and the Google Doc.
>>>> This will possibly be even more difficult to achieve, because we should
>>>> have the capacity to support multiple processes since we now support
>> High
>>>> Availability in Airflow.
>>>> So although setting up a separate process decouples the SLA evaluation
>>> from
>>>> the scheduler, we need to acknowledge that we may be introducing a
>> heavy
>>>> dependency on the metadata database.
>>>> 
>>>> My suggestion to leverage the existing Triggerer process to design
>>>> monitoring Deferrable Operators to execute SLA callbacks has the
>> benefit
>>> of
>>>> reducing the load on the database while achieving similar goals,
>> because
>>> it
>>>> registers the SLA monitoring operator as a TASK to the dag_run that it
>> is
>>>> associated with, and prevents the dag_run from completing if the SLA
>> has
>>>> not yet computed. This means that our query will be strictly limited to
>>>> just the dagRuns in RUNNING state - this is a HUGE difference from
>> having
>>>> to query dagruns in all states in a separate process, because we are
>>> merely
>>>> attaching a few additional tasks to be executed into existing dag_runs.
>>>> 
>>>> In summary: I'm open to this idea, I just have not been able to think
>> of
>>> a
>>>> way to manage this without overloading the scheduler, or the database.
>>>> 
>>>> 3. task: time_limit_sla
>>>> Jarek: That sounds like a great idea that we could group into this AIP
>> -
>>> I
>>>> will make some time to add some code snippets into the AIP to make this
>>>> idea a bit clearer to everyone reading it in preparation for the vote
>>>> 
>>>> 
>>>> Sung
>>>> 
>>>> On Sun, Jun 18, 2023 at 9:38 PM utkarsh sharma <utkarshar...@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>>>> 
>>>>>> This can be IMHO implemented on the task level. We currently have
>>>> timeout
>>>>>> implemented this way - whenever we start the task, we can have a
>>> signal
>>>>>> handler registered with "real" time registered that will cancel the
>>>> task.
>>>>>> But I can imagine similar approach with signal and propagate the
>>>>>> information that task exceeded the time it has been allocated but
>>> would
>>>>> not
>>>>>> stop it, just propagate the information (in a form of current way
>> we
>>> do
>>>>>> callbacks for example, or maybe (even better) only run it in the
>>>> context
>>>>> of
>>>>>> task  to signal "soft timeout" per task:
>>>>>> 
>>>>>>>            signal.signal(signal.SIGALRM, self.handle_timeout)
>>>>>>>           signal.setitimer(signal.ITIMER_REAL, self.seconds)
>>>>>> 
>>>>>> This has an advantage that it is fully distributed - i.e. we do not
>>>> need
>>>>>> anything to monitor 1000s of tasks running to decide if SLA has
>> been
>>>>>> breached. It's the task itself that will get the "soft" timeout and
>>>>>> propagate it (and then whoever receives the callback can decide
>> what
>>> to
>>>>> do
>>>>>> next - and this "callback" can happen in either the task context or
>>> it
>>>>>> could be done in a DagFileProcessor context as we do currently -
>>> though
>>>>> the
>>>>>> in-task processing seems much more distributed and scalable in
>>> nature.
>>>>>> There is one watch-out here that this is not **guaranteed** to
>> work,
>>>>> there
>>>>>> are cases, that we already saw that the SIGALRM is not going to be
>>>>> handled
>>>>>> locally, when the task uses long running C-level function that is
>> not
>>>>>> written in the way to react to signals generated in Python (thinks
>>>>>> low-level long-running Pandas c-method call that does not check
>>> signal
>>>>> in a
>>>>>> long-running-loop. That however probably could be handled by one
>> more
>>>>>> process fork and have a dedicated child process that would monitor
>>>>> running
>>>>>> tasks from a separate process - and we could actually improve both
>>>>> timeout
>>>>>> and SLA handling by introducing such extra forked process to handle
>>>>>> timeout/task level time_limit_sla, so IMHO this is an opportunity
>> to
>>>>>> improve things.
>>>>>> 
>>>>> 
>>>>> 
>>>>> Building on what Jarek mentioned, If we can enable the scheduler to
>>> emit
>>>>> events for DAGs with SLA configured in cases of
>>>>> 1. DAG starts executing
>>>>> 2. Task - start executing(for every task)
>>>>> 3. Task - stop executing(for every task)
>>>>> 4. DAG stops executing
>>>>> 
>>>>> And have a separate process(per dag run) that can keep monitoring
>> such
>>>>> events and execute a callback in the following circumstances:
>>>>> 1. DAG level SLA miss
>>>>>    - When the entire DAG didn't finish in a specific time
>>>>> 2. Task-level SLA miss
>>>>>    - Counting time from the start of the DAG to the end of a task.
>>>>>    - Start of a task to end of a task.
>>>>> 
>>>>> I think above approch should be addressing issues listed in AIP-57
>>>>> <
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-57+SLA+as+a+DAG-Level+Feature
>>>>>> 
>>>>> 1. Since we have a septate process we no longer have to wait for the
>>>>> Tasks/DAG to be in the SUCCESS/SKIPPED state or any other terminal
>>> state.
>>>>> In this process, we can have a loop executing periodically in
>> intervals
>>>> of
>>>>> ex- 1sec to monitor SLA misses by monitoring events data and
>>>> task_instance
>>>>> table.
>>>>> 2. For Manually/Dataset triggered dags, we no longer have a
>> dependency
>>>> on a
>>>>> fixed schedule, everything we need to evaluate SLA miss is already
>>>> present
>>>>> in Events for that specific DAG.
>>>>> 3. This approach also enables us to run callbacks on the task level.
>>>>> 4. We can remove calls to sla_miss_callbacks every time is called
>>>>> *dag_run.update_state*
>>>>> 
>>>>> A couple of things I'm not about are -
>>>>> 1. Where to execute the callbacks. Executing a callback in the same
>>>> process
>>>>> as the monitoring process can have a downside if the callback takes
>>> much
>>>>> time to execute it will probably cause other SLA callbacks to be
>>> delayed.
>>>>> 2. Context of execution of callback, we have to maintain the same
>>> context
>>>>> in which the callback is defined.
>>>>> 
>>>>> Would love to know other people's thoughts on this :)
>>>>> 
>>>>> Thanks,
>>>>> Utkarsh Sharma
>>>>> 
>>>>> 
>>>>> On Sun, Jun 18, 2023 at 4:08 PM Iaroslav Poskriakov <
>>>>> yaroslavposkrya...@gmail.com> wrote:
>>>>> 
>>>>>> I want to say that airflow is a very popular project and the ways
>> of
>>>>>> calculating SLA are different. Because of different business cases.
>>> And
>>>>> if
>>>>>> it's possible we should make most of them from the box.
>>>>>> 
>>>>>> вс, 18 июн. 2023 г. в 13:30, Iaroslav Poskriakov <
>>>>>> yaroslavposkrya...@gmail.com>:
>>>>>> 
>>>>>>> So, I totally agree about dag level slas. It's very important to
>>> have
>>>>> it
>>>>>>> and according to Sung Yun proposal it should be implemented not
>> on
>>>> the
>>>>>>> scheduler job level.
>>>>>>> 
>>>>>>> Regarding the second way of determining SLA: <task state STARTED>
>>> -->
>>>>>>> ..<doesn't matter what happened>..  --> <task state SUCCESS>.
>>>>>>> It's very helpful in the way when we want to achieve not
>> technical
>>>> SLA
>>>>>> but
>>>>>>> business SLA for the team which is using that DAG. Because
>> between
>>>>> those
>>>>>>> two states anything could happen and at the end we might want to
>>>>>> understand
>>>>>>> high level SLA for the task. Because it doesn't matter for
>>> business I
>>>>>> guess
>>>>>>> that path of states of the task was something like: STARTED ->
>>>> RUNNING
>>>>> ->
>>>>>>> FAILED -> RUNNING -> FAILED -> RUNNING -> SUCCESS. And in case
>> when
>>>>>>> something similar is happening it can be helpful to have an
>>>> opportunity
>>>>>> of automatically
>>>>>>> recognizing  that the expected time for the task crossed the
>>> border.
>>>>>>> 
>>>>>>> I agree that for the scheduler it can be too heavy. And also for
>>> that
>>>>>>> purpose we need to have some process which is running in parallel
>>>> with
>>>>>> the
>>>>>>> task. It can be one more job for example which is running on the
>>> same
>>>>>>> machine as Scheduler, or not on the same.
>>>>>>> 
>>>>>>> 
>>>>>>> About the third part of my proposal - time for the task in the
>>>>>>> RUNNING state. I agree with you, Jarek. We can implement it on
>> the
>>>> task
>>>>>>> level. For me it seems good.
>>>>>>> 
>>>>>>> Yaro1
>>>>>>> 
>>>>>>> вс, 18 июн. 2023 г. в 08:12, Jarek Potiuk <ja...@potiuk.com>:
>>>>>>> 
>>>>>>>> I am also for DAG level SLA only (but maybe there are some
>>> twists).
>>>>>>>> 
>>>>>>>> And I hope (since Sung Yun has not given up on that) - maybe
>> that
>>> is
>>>>> the
>>>>>>>> right time that others here will chime in and maybe it will let
>>> the
>>>>> vote
>>>>>>>> go
>>>>>>>> on? I think it would be great to get the SLA feature sorted out
>> so
>>>>> that
>>>>>> we
>>>>>>>> have a chance to stop answering ("yeah, we know SLA is broken,
>> it
>>>> has
>>>>>>>> always been"). It would be nice to say "yeah the old deprecated
>>> SLA
>>>> is
>>>>>>>> broken, but we have this new mechanism(s) that replace it". The
>>> one
>>>>>>>> proposed by Sung has a good chance of being such a replacement.
>>>>>>>> 
>>>>>>>> I think having a task-level SLA managed by the Airflow framework
>>>> might
>>>>>>>> indeed be too costly and does not fit well in the current
>>>>> architecture.
>>>>>> I
>>>>>>>> think attempting to monitor how long a given task runs by the
>>>>> scheduler
>>>>>> is
>>>>>>>> simply a huge overkill. Generally speaking - scheduler (as
>>>> surprising
>>>>> it
>>>>>>>> might be for anyone) does not monitor executing tasks (at least
>>>>>>>> principally
>>>>>>>> speaking). It merely submits the tasks to execute to executor
>> and
>>>> let
>>>>>>>> executor handle all kinds of monitoring of what is being
>> executed
>>>>> when,
>>>>>>>> and
>>>>>>>> then - depending on the different types of executors there are
>>>> various
>>>>>>>> conditions when and how task is being executed, and various ways
>>> how
>>>>> you
>>>>>>>> can define different kinds of task SLAs. Or at least this is
>> how I
>>>>> think
>>>>>>>> about the distributed nature of Airflow on a "logical" level.
>> Once
>>>>> task
>>>>>> is
>>>>>>>> queued for execution, the scheduler takes its hands off and
>> turns
>>>> its
>>>>>>>> attention to tasks that are not yet scheduled and should be or
>>> tasks
>>>>>> that
>>>>>>>> are scheduled but not queued yet.
>>>>>>>> 
>>>>>>>> But maybe some of the SLA "task" expectations can be
>> implemented
>>>> in a
>>>>>>>> limited version serving very limited cases on a task level?
>>>>>>>> 
>>>>>>>> Referring to what Yaro1 wrote:
>>>>>>>> 
>>>>>>>>> 1. It doesn't matter for us how long we are spending time on
>>> some
>>>>>>>> specific
>>>>>>>> task. It's important to have an understanding of the lag between
>>>>>>>> execution_date of dag and success state for the task. We can
>> call
>>> it
>>>>>>>> dag_sla. It's similar to the current implementation of
>>> manage_slas.
>>>>>>>> 
>>>>>>>> This is basically what Sung proposes, I believe.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 2. It's important to have an understanding and managing how
>> long
>>>>> some
>>>>>>>> specific task is working. In my opinion working is the state
>>> between
>>>>>> task
>>>>>>>> last start_date and task first (after last start_date) SUCCESS
>>>> state.
>>>>> So
>>>>>>>> for example for the task which is placed in FAILED state we
>> still
>>>> have
>>>>>> to
>>>>>>>> check an SLA in that strategy. We can call it task_sla.
>>>>>>>> 
>>>>>>>> I am not sure if I understand it, but If I do, then this is the
>>>> "super
>>>>>>>> costly" SLA processing that we should likely avoid. I would love
>>> to
>>>>> hear
>>>>>>>> however, what are some specific use cases that we could show
>> here,
>>>>> maybe
>>>>>>>> there are other ways we can achieve similar things.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 3. Sometimes we need to manage time for the task in the
>> RUNNING
>>>>> state.
>>>>>>>> We
>>>>>>>> can call it time_limit_sla.
>>>>>>>> 
>>>>>>>> This can be IMHO implemented on the task level. We currently
>> have
>>>>>> timeout
>>>>>>>> implemented this way - whenever we start the task, we can have a
>>>>> signal
>>>>>>>> handler registered with "real" time registered that will cancel
>>> the
>>>>>> task.
>>>>>>>> But I can imagine similar approach with signal and propagate the
>>>>>>>> information that task exceeded the time it has been allocated
>> but
>>>>> would
>>>>>>>> not
>>>>>>>> stop it, just propagate the information (in a form of current
>> way
>>> we
>>>>> do
>>>>>>>> callbacks for example, or maybe (even better) only run it in the
>>>>> context
>>>>>>>> of
>>>>>>>> task  to signal "soft timeout" per task:
>>>>>>>> 
>>>>>>>>>            signal.signal(signal.SIGALRM, self.handle_timeout)
>>>>>>>>>           signal.setitimer(signal.ITIMER_REAL, self.seconds)
>>>>>>>> 
>>>>>>>> This has an advantage that it is fully distributed - i.e. we do
>>> not
>>>>> need
>>>>>>>> anything to monitor 1000s of tasks running to decide if SLA has
>>> been
>>>>>>>> breached. It's the task itself that will get the "soft" timeout
>>> and
>>>>>>>> propagate it (and then whoever receives the callback can decide
>>> what
>>>>> to
>>>>>> do
>>>>>>>> next - and this "callback" can happen in either the task context
>>> or
>>>> it
>>>>>>>> could be done in a DagFileProcessor context as we do currently -
>>>>> though
>>>>>>>> the
>>>>>>>> in-task processing seems much more distributed and scalable in
>>>> nature.
>>>>>>>> There is one watch-out here that this is not **guaranteed** to
>>> work,
>>>>>> there
>>>>>>>> are cases, that we already saw that the SIGALRM is not going to
>> be
>>>>>> handled
>>>>>>>> locally, when the task uses long running C-level function that
>> is
>>>> not
>>>>>>>> written in the way to react to signals generated in Python
>> (thinks
>>>>>>>> low-level long-running Pandas c-method call that does not check
>>>> signal
>>>>>> in
>>>>>>>> a
>>>>>>>> long-running-loop. That however probably could be handled by one
>>>> more
>>>>>>>> process fork and have a dedicated child process that would
>> monitor
>>>>>> running
>>>>>>>> tasks from a separate process - and we could actually improve
>> both
>>>>>> timeout
>>>>>>>> and SLA handling by introducing such extra forked process to
>>> handle
>>>>>>>> timeout/task level time_limit_sla, so IMHO this is an
>> opportunity
>>> to
>>>>>>>> improve things.
>>>>>>>> 
>>>>>>>> I would love to hear what others think about it :)? I think our
>>> SLA
>>>>> for
>>>>>>>> fixing SLA is about to run out.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> J.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Jun 15, 2023 at 4:05 PM Sung Yun <sy...@cornell.edu>
>>> wrote:
>>>>>>>> 
>>>>>>>>> Hello!
>>>>>>>>> 
>>>>>>>>> Thank you very much for the feedback on the proposal. I’ve
>> been
>>>>> hoping
>>>>>>>> to
>>>>>>>>> get some more traction on this proposal, so it’s great to hear
>>>> from
>>>>>>>> another
>>>>>>>>> user of the feature.
>>>>>>>>> 
>>>>>>>>> I understand that there’s a lot of support for keeping a
>> native
>>>> task
>>>>>>>> level
>>>>>>>>> SLA feature, and I definitely agree with that sentiment. Our
>>>>>>>> organization
>>>>>>>>> very much relies on Airflow to evaluate ‘task_sla’ in order to
>>>> keep
>>>>>>>> track
>>>>>>>>> of which tasks in each dags failed to succeed by an expected
>>> time.
>>>>>>>>> 
>>>>>>>>> In the AIP I put together on the Confluence page, and in the
>>>> Google
>>>>>>>> docs,
>>>>>>>>> I have identified why the existing implementation of the task
>>>> level
>>>>>> SLA
>>>>>>>>> feature can be problematic and is often misleading for Airflow
>>>>> users.
>>>>>>>> The
>>>>>>>>> feature is also quite costly for Airflow scheduler and
>>>>> dag_processor.
>>>>>>>>> 
>>>>>>>>> In that sense, the discussion is not about whether or not
>> these
>>>> SLA
>>>>>>>>> features are important to the users, but much more technical.
>>> Can
>>>> a
>>>>>>>>> task-level feature be supported in a first-class way as a core
>>>>> feature
>>>>>>>> of
>>>>>>>>> Airflow, or should it be implemented by the users, for example
>>> as
>>>>>>>>> independent tasks by leveraging Deferrable Operators.
>>>>>>>>> 
>>>>>>>>> My current thought is that only Dag level SLAs can be
>> supported
>>>> in a
>>>>>>>>> non-disruptive way by the scheduler, and that task level SLAs
>>>> should
>>>>>> be
>>>>>>>>> handled outside of core Airflow infrastructure code. If you
>>>> strongly
>>>>>>>>> believe otherwise, I think it would be helpful if you could
>>>> propose
>>>>> an
>>>>>>>>> alternative technical solution that solves many of the
>> existing
>>>>>>>> problems in
>>>>>>>>> the task-level SLA feature.
>>>>>>>>> 
>>>>>>>>> Sent from my iPhone
>>>>>>>>> 
>>>>>>>>>> On Jun 13, 2023, at 1:10 PM, Ярослав Поскряков <
>>>>>>>>> yaroslavposkrya...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Mechanism of SLA
>>>>>>>>>> 
>>>>>>>>>> Hi, I read the previous conversation regarding SLA and I
>> think
>>>>>>>> removing
>>>>>>>>> the
>>>>>>>>>> opportunity to set sla for the task level will be a big
>>> mistake.
>>>>>>>>>> So, the proposed implementation of the task level SLA will
>> not
>>>> be
>>>>>>>> working
>>>>>>>>>> correctly.
>>>>>>>>>> 
>>>>>>>>>> That's why I guess we have to think about the mechanism of
>>> using
>>>>>> SLA.
>>>>>>>>>> 
>>>>>>>>>> I guess we should check three different cases in general.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 1. It doesn't matter for us how long we are spending time on
>>>> some
>>>>>>>>> specific
>>>>>>>>>> task. It's important to have an understanding of the lag
>>> between
>>>>>>>>>> execution_date of dag and success state for the task. We can
>>>> call
>>>>> it
>>>>>>>>>> dag_sla. It's similar to the current implementation of
>>>>> manage_slas.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2. It's important to have an understanding and managing how
>>> long
>>>>>> some
>>>>>>>>>> specific task is working. In my opinion working is the state
>>>>> between
>>>>>>>> task
>>>>>>>>>> last start_date and task first (after last start_date)
>> SUCCESS
>>>>>> state.
>>>>>>>> So
>>>>>>>>>> for example for the task which is placed in FAILED state we
>>>> still
>>>>>>>> have to
>>>>>>>>>> check an SLA in that strategy. We can call it task_sla.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 3. Sometimes we need to manage time for the task in the
>>> RUNNING
>>>>>>>> state. We
>>>>>>>>>> can call it time_limit_sla.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Those three types of SLA will cover all possible cases.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> So we will have three different strategies for SLA.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I guess we can use for dag_sla that idea -
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> For task_sla and time_limit_sla I prefer to stay with using
>>>>>>>> SchedulerJob
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Github: Yaro1
>>>>>>>>> 
>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>>>>>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sung Yun
>>>> Cornell Tech '20
>>>> Master of Engineering in Computer Science
>>>> 
>>> 
>> 
>> 
>> --
>> Sung Yun
>> Cornell Tech '20
>> Master of Engineering in Computer Science
>>

Re: [DISCUSS] Mechanism of SLA

Reply via email to