Re: SLA semantics

Gerard Toonstra Wed, 26 Jun 2019 22:52:44 -0700

On Wed, Jun 26, 2019 at 9:45 PM Gerard Toonstra <gtoons...@gmail.com> wrote:


>
> That's not my experience of how SLA's work at the moment. I've observed
> this to currrently work as:
>
> 1. An SLA is configured as the "time delta" after some dag execution
> schedule.
> 2. The SLA is configured at task level, so any tasks still running or need
> to run after "time delta" will be aggregated together in one "SLA email".
> 3. The email is sent only once at the time the SLA misses in the "dag run".
> 4. The email is sent by the scheduler, not some worker.
>
> What I did notice:
>
> * If the scheduler cannot contact an email server, it will delay the
> scheduler loop.
>

Slight correction here, not the main scheduler loop, but the processing dag
threads/loops. We had 5 dags with configured SLA's and the mailserver was
not
reachable. This delayed tasks on the scheduler by minutes as each time it
processed these dags it would attempt sending the not yet sent emails.


> * As the emails do not get sent, it will try again next time the dag
> configured with an SLA gets parsed, thus again impacting the scheduler loop.
> * If the SLA emails do not succeed and later on they do, you get a huge
> email with everything combined.
>
> What we decided is not to rely on airflow SLA's, but to enforce and detect
> SLA's externally based on success/fail metadata that we receive from
> airflow.
>
> The rationale is:
> * we want to get better insights when workflows (dags) are completed
> anyway, so we wanted dag completion data available elsewhere outside the
> airflow db.,
> * we want to avoid any negative impact on the main scheduler loop due to
> mailing system availability.
>
>
> On Wed, Jun 26, 2019 at 9:18 PM Andrew Stahlman <astahl...@lyft.com.invalid>
> wrote:
>
>> Hi all,
>>
>> I'm looking to get some clarity on the intended behavior for
>> SLAs. This has come up several times in the past, but as far as I can
>> tell there hasn't been a definitive answer. As pointed out in
>> https://issues.apache.org/jira/browse/AIRFLOW-249 (open for several
>> years now):
>>
>>     the SLA logic is only being fired after following_schedule + sla
>>     has elapsed, in other words one has to wait for the next TI before
>>     having a chance of getting any email. Also the email reports
>>     dag.following_schedule time (I guess because it is close of
>>     TI.start_date), but unfortunately that doesn't match what the task
>>     instances shows nor the log filename
>>
>> Example: Consider a TI from a @daily DAG with execution date of Monday
>> at 00:00. It will start executing soon after Tuesday 00:00. If I set
>> the SLA to 5 minutes, I would expect an SlaMiss to be created at
>> Tuesday 00:05, but it's actually not created until *Wednesday* 00:05.
>>
>> I find this behavior very surprising, and it seems I'm not the only
>> one (see [1], [2]). Can someone confirm whether this is really the
>> desired behavior?
>>
>> I think removing a single line [3] from the manage_slas implementation
>> would bring the behavior in line with what I expected - namely, that
>> an SlaMiss will be created based on:
>>
>>     execution_date + schedule_interval + sla
>>
>> ...as opposed to the current behavior of:
>>
>>     execution_date + (2 * schedule_interval) + sla
>>
>> I'd be happy to open a PR for that if we reach consensus on the
>> desired behavior.
>>
>> Thanks,
>> Andrew
>>
>> [1]
>>
>> https://stackoverflow.com/questions/44071519/how-to-set-a-sla-in-airflow?rq=1
>> ,
>> [2] https://issues.apache.org/jira/browse/AIRFLOW-2781
>> [3]
>>
>> https://github.com/apache/incubator-airflow/blob/6afb12f0e5c18e8634daa0119d6e5797aa770b80/airflow/jobs/scheduler_job.py#L425
>>
>

Re: SLA semantics

Reply via email to