Re: [DISCUSS] Mechanism of SLA

Sung Yun Wed, 20 Sep 2023 12:47:52 -0700

Hi Damian, it's so great to hear from you - it's been a minute! I
personally haven't heard of OLA, but that's a very helpful context. So in a
sense I think OLAs are less-official SLAs that are used loosely in an
interdepartmental setting? I think the official terminology Google's SRE
team uses to distinguish a contractual vs non-contractual agreement are
actually SLA (Service-Level Agreement) vs SLA (Service-Level Objective), so
I find it interesting that there's yet another term for something very
similar!

On that note, I think nomenclature is underrated, but incredibly important
in engineering; because if we choose it wrong, we give tremendous power of
imagination of what the feature is 'supposed to do' to the users. In an
open community like this, that's maybe a recipe for the feature to quickly
degrade itself over time, if it results in its responsibility being poorly
defined. And I think that may be what might have happened with this feature
- there are vastly different ideas of what this 'SLA' feature is supposed
to be doing, at this point in time. So like you said, whatever feature we
end up building a consensus on this time around, it may be valuable to
define the name of this new feature with clarity that speaks for itself.

Maybe the next steps we need to take here is:

1. Decide with intentionality, what user requirements we are solving for in
the refactored feature (and which ones we are not) - I think Daniel has
already begun to take a stab at this properly, by asking the room of users
what kind of feature they actually need for a delay detecting feature to be
usable. I will also try to gather data points as well, beyond what I have
scraped so far online.
2. Evaluate at what component we would need to build the feature. In this
discussion, we will keep in mind the following topics: how much of the
scheduler / dag file processor's capacity we think is fine to take up for
evaluating the delays and their callbacks, the function signature /
interface of the callback functions
3. And lastly, we could decide on what the proper name should be, for this
reliability feature that ended up with
Iterate 1-3 over again if it is not satisfactory.

And yes Daniel I noticed that about the Keynote Panel as well - and I am
very excited to get some more traction on this discussion. Also, sorry
about the confusion start date vs expected start date - I think I made that
point clear in an earlier response that I am okay with either measure, so I
was a bit confused about what points you were referring to when you said
'this is just a timeout' :

Here's my original response: "[2] In the accepted version of the AIP,
the DAG SLA is defined from the start_date of the DAG. I had initially
considered for it to be evaluated from the scheduled start time
(data_interval_end), but was stuck on how we could avoid false alarms when
dag runs are cleared, which was discussed in the Google Doc that was
circulated
<https://docs.google.com/document/d/1drNaYmAy6GqC4WGGn4MNt6VqbOwVNm7jPfmr5Pc52AU/edit#heading=h.z45mqf5nt94>
before
the AIP was drafted. I've submitted a small change this week to introduce
the dag_run clear_number attribute
<https://github.com/apache/airflow/pull/34126> which would actually
mitigate this issue. So if we feel that using the data_interval_end over
the start_date for a SCHEDULED DagRunType is more appropriate, we can make
that change on the open DAG SLA PR
<https://github.com/apache/airflow/pull/33532>."

TLDR: There was an edge case that wasn't thought through that made the
expected start time based delay monitoring a poor option for scheduled dag
runs. Now that this change is in, I'm fine with going with either
definition, whichever we think fulfills our requirements better.

Sung

On Wed, Sep 20, 2023 at 1:16 PM Damian Shaw <ds...@striketechnologies.com>
wrote:

> I'm not sure where most of the conversation is going on, but I would like
> to add a little experience previously working in large enterprises where
> SLAs were a part of day to day life.
>
> I would agree that both use cases are common:
>         1) Some alert or action for when a DAG or task *should* have
> completed but hasn't
>         2) Some alert or action for when a DAG or task exceeded some time
> from when it started
>
> In my experience the organizations I have work with tended to refer to 1
> as an SLA (Service Level Agreement) and 2 as an OLA (Operational Level
> Agreement), the idea that a service as a whole should stick to agreed
> times, but also there should be operational agreements that only start once
> prerequisites have been completed. I am not sure if this is common in
> general or unique terminology within the organizations I have worked with.
>
> From my experience I would therefore agree that to many it would be
> unintuitive if the behavior is 2 but it is called SLA.
>
> Damian
>
> -----Original Message-----
> From: Daniel Standish <daniel.stand...@astronomer.io.INVALID>
> Sent: Wednesday, September 20, 2023 11:29 AM
> To: dev@airflow.apache.org
> Subject: Re: [DISCUSS] Mechanism of SLA
>
> I don't think of it as really a question about accurate record keeping but
> more a question of what an SLA is, i.e. when do you want the warning, or
> what do you want the warning based on.  I think that the idea has been that
> it really means, "if task not done by X time each day then warn".  And the
> way this was defined is dag schedule + timedelta.  And, it does seem that
> this is sort of a desired feature.  Indeed it just came up again in one of
> the keynotes. But, it will be nice to talk about it tomorrow and see what
> others think.
>
> Thanks for the blog post.  Reading it was productive for me.  I hadn't
> really considered the fact that the existing way that SLA works could be
> counterintuitive.  I can see how it wcould be.  You set it as a timedelta
> param on a task, and then this timedelta is added to the dag "should start"
> date, instead of task duration.  Anyway, again, look forward to chatting
> about it.
> ________________________________
>  Strike Technologies, LLC (“Strike”) is part of the GTS family of
> companies. Strike is a technology solutions provider, and is not a broker
> or dealer and does not transact any securities related business directly
> whatsoever. This communication is the property of Strike and its
> affiliates, and does not constitute an offer to sell or the solicitation of
> an offer to buy any security in any jurisdiction. It is intended only for
> the person to whom it is addressed and may contain information that is
> privileged, confidential, or otherwise protected from disclosure.
> Distribution or copying of this communication, or the information contained
> herein, by anyone other than the intended recipient is prohibited. If you
> have received this communication in error, please immediately notify Strike
> at i...@striketechnologies.com, and delete and destroy any copies hereof.
> ________________________________
>
> CONFIDENTIALITY / PRIVILEGE NOTICE: This transmission and any attachments
> are intended solely for the addressee. This transmission is covered by the
> Electronic Communications Privacy Act, 18 U.S.C ''2510-2521. The
> information contained in this transmission is confidential in nature and
> protected from further use or disclosure under U.S. Pub. L. 106-102, 113
> U.S. Stat. 1338 (1999), and may be subject to attorney-client or other
> legal privilege. Your use or disclosure of this information for any purpose
> other than that intended by its transmittal is strictly prohibited, and may
> subject you to fines and/or penalties under federal and state law. If you
> are not the intended recipient of this transmission, please DESTROY ALL
> COPIES RECEIVED and confirm destruction to the sender via return
> transmittal.
>

-- 
Sung Yun
Cornell Tech '20
Master of Engineering in Computer Science

Re: [DISCUSS] Mechanism of SLA

Reply via email to