The issue Andrew raised here, that large SLA volumes can totally gunk up Airflow as a whole, is the main reason I do not make a habit of recommending the use of SLA. Fixing this would at least make the feature serviceable and safe, which would be a big step up.
Collin On Sun, Aug 21, 2022 at 8:07 AM Jarek Potiuk <[email protected]> wrote: > I think generally that SLA has been broken for quite some time and while > we are continuously saying "we need to fix it" there is no effort to do so. > Not sure if this is because it is unimportant. But also maybe the feature > actually is useful in a number of cases and not broken too much. And maybe > we can fix it in some incremental steps rather than totally rewriting it. > > In this context - this change is supposed to fix one of the problems with > SLA - if there are too many of those (which can be generated if you have > some failures) it actually might get to the point that it will block dag > file processor from doing the "Real" stuff. > Adding priority queue is a good idea I think. And it's is a good first > step. There could be other ideas in place (de-duplicating tha callbacks is > I think already implemented), maybe simply dropping some of the callbacks > eventually. But I think this looks promising to start with that. > > WDYT everyone? Is there anyone else planning to do any serious stuff on > SLA callbacks? Should we take a close look at that one. Mateusz Henc, I > think this might also have some - positive impact on finalizing the AIP-43? > > J. > > > On Thu, Aug 4, 2022 at 10:09 PM Andrew Gibbs < > [email protected]> wrote: > >> Hi everyone, >> >> First time post to the dev list ... please be gentle! >> >> I raised a PR to fix SLA alerts ( >> https://github.com/apache/airflow/pull/25489) but it's not trivial. >> Jared Potiuk asked that I flag it up here where it might get more >> attention, and I'm happy to oblige. >> >> >> *A brief summary of the problem* >> To the end-user, adding SLAs means that your system stops processing >> changes to the dag files. >> >> *A brief summary of the cause* >> Adding a SLA to a task in a dag means SLA callbacks get raised and passed >> back to the DagProcessorManager. The callbacks can get created in >> relatively large volumes, enough that the manager's queue never empties. >> This in turn causes the manager to stop checking the file-system for >> changes >> >> *A brief summary of the fix* >> The changes are confined to the manager.py file. What I have done is: >> 1. split the manager's queue into two (a standard and a priority queue) >> 2. track processing of the dags from disk independently of the queue, so >> that we'll rescan even if the queue is not empty. >> 3. added a config flag that causes the manager to scan stale dags on >> disk, even if there are a a lot of priority callbacks. >> >> This means that SLA callbacks are processed and alerts are raised in a >> timely fashion, but we continue to periodically scan the file system for >> files and process any changes. >> >> *Other notes* >> 1. First and foremost, if you're interested, please do have a look at the >> PR. I have done my best to document it thoroughly. There are new tests too! >> 2. The goal here is simply to make it so that adding SLAs doesn't kill >> the rest of the system. I haven't changed how they're defined, how the >> system raises them, or anything else. It's purely a fix to the queue(s) >> inside the manager. It's as low touch as I could make it. >> 3. I do have a *much* simpler fix (one line change), which works, but >> isn't perfect, particularly under certain config settings. This change is >> more complicated, but I think solves the problem "properly". >> >> That's it. Thanks for reading! >> >> A >> >> -- Collin McNulty Lead Airflow Engineer Email: [email protected] <[email protected]> Time zone: US Central (CST UTC-6 / CDT UTC-5) <https://www.astronomer.io/>
