The issue Andrew raised here, that large SLA volumes can totally gunk up
Airflow as a whole, is the main reason I do not make a habit of
recommending the use of SLA. Fixing this would at least make the feature
serviceable and safe, which would be a big step up.

Collin


On Sun, Aug 21, 2022 at 8:07 AM Jarek Potiuk <[email protected]> wrote:

> I think generally that SLA has been broken for quite some time and while
> we are continuously saying "we need to fix it" there is no effort to do so.
> Not sure if this is because it is unimportant. But also maybe the feature
> actually is useful in a number of cases and not broken too much. And maybe
> we can fix it in some incremental steps rather than totally rewriting it.
>
> In this context - this change is supposed to fix one of the problems with
> SLA - if there are too many of those (which can be generated if you have
> some failures) it actually might get to the point that it will block dag
> file processor from doing the "Real" stuff.
> Adding priority queue is a good idea I think. And it's is a good first
> step. There could be other ideas in place (de-duplicating tha callbacks is
> I think already implemented), maybe simply dropping some of the callbacks
> eventually. But I think this looks promising to start with that.
>
> WDYT everyone? Is there anyone else planning to do any serious stuff on
> SLA callbacks? Should we take a close look at that one. Mateusz Henc, I
> think this might also have some - positive impact on finalizing the AIP-43?
>
> J.
>
>
> On Thu, Aug 4, 2022 at 10:09 PM Andrew Gibbs <
> [email protected]> wrote:
>
>> Hi everyone,
>>
>> First time post to the dev list ... please be gentle!
>>
>> I raised a PR to fix SLA alerts (
>> https://github.com/apache/airflow/pull/25489) but it's not trivial.
>> Jared Potiuk asked that I flag it up here where it might get more
>> attention, and I'm happy to oblige.
>>
>>
>> *A brief summary of the problem*
>> To the end-user, adding SLAs means that your system stops processing
>> changes to the dag files.
>>
>> *A brief summary of the cause*
>> Adding a SLA to a task in a dag means SLA callbacks get raised and passed
>> back to the DagProcessorManager. The callbacks can get created in
>> relatively large volumes, enough that the manager's queue never empties.
>> This in turn causes the manager to stop checking the file-system for
>> changes
>>
>> *A brief summary of the fix*
>> The changes are confined to the manager.py file. What I have done is:
>> 1. split the manager's queue into two (a standard and a priority queue)
>> 2. track processing of the dags from disk independently of the queue, so
>> that we'll rescan even if the queue is not empty.
>> 3. added a config flag that causes the manager to scan stale dags on
>> disk, even if there are a a lot of priority callbacks.
>>
>> This means that SLA callbacks are processed and alerts are raised in a
>> timely fashion, but we continue to periodically scan the file system for
>> files and process any changes.
>>
>> *Other notes*
>> 1. First and foremost, if you're interested, please do have a look at the
>> PR. I have done my best to document it thoroughly. There are new tests too!
>> 2. The goal here is simply to make it so that adding SLAs doesn't kill
>> the rest of the system. I haven't changed how they're defined, how the
>> system raises them, or anything else. It's purely a fix to the queue(s)
>> inside the manager. It's as low touch as I could make it.
>> 3. I do have a *much* simpler fix (one line change), which works, but
>> isn't perfect, particularly under certain config settings. This change is
>> more complicated, but I think solves the problem "properly".
>>
>> That's it. Thanks for reading!
>>
>> A
>>
>> --

Collin McNulty
Lead Airflow Engineer

Email: [email protected] <[email protected]>
Time zone: US Central (CST UTC-6 / CDT UTC-5)


<https://www.astronomer.io/>

Reply via email to