Hi folks, I've created a branch off of v1-10-test; the diff can be found here: https://github.com/apache/incubator-airflow/compare/v1-10-test...Eronarn:sla_improvements
As a recap, this work is expected to do the following: - split the "sla" parameter into three independent SLAs: expected duration, expected start, and expected finish - move the SLA miss callback to be a task-level attribute rather than DAG-level (removing a lot of the "batching" functionality) - convert the SLA miss email to the default SLA miss callback - add a "type" to SLA misses, which will be part of the primary key, and can be checked against in the callback to respond appropriately to the type of SLA that was missed. - don't send SLA misses for skipped tasks, or for backfill jobs Before I polish up the remaining TODO functions and write a migration and tests, I'd appreciate feedback from the maintainers as to whether this seems to be on the right track, design-wise. (Note that it's definitely not going to pass tests right now; I am having significant problems getting Airflow's test suite running locally so I'm not even attempting at the moment.) Thanks, -James M. On Wed, May 9, 2018 at 12:43 PM, James Meickle <jmeic...@quantopian.com> wrote: > Hi all, > > Since the response so far has been positive or neutral, I intend to submit > one or more PRs targeting 2.0 (I think that some parts will be separable > from a larger SLA refactor). I intend to address at least the following > JIRA issues: > > https://issues.apache.org/jira/browse/AIRFLOW-2236 > https://issues.apache.org/jira/browse/AIRFLOW-1472 > https://issues.apache.org/jira/browse/AIRFLOW-1360 > https://issues.apache.org/jira/browse/AIRFLOW-557 > https://issues.apache.org/jira/browse/AIRFLOW-133 > > Regards, > > -James M. > > > > On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin < > maximebeauche...@gmail.com> wrote: > >> About de-coupling the SLA management process, I've had conversations in >> the >> direction of renaming the scheduler to "supervisor" to reflect the fact >> that it's not just scheduling processes, it does a lot more tasks than >> just >> that, SLA management being one of them. >> >> I still think the default should be to require a single supervisor that >> would do all the "supervision" work though. I'm generally against >> requiring >> more types of nodes on the cluster. But perhaps the supervisor could have >> switches to be started in modes where it would only do a subset of its >> tasks, so that people can run multiple specialized supervisor nodes if >> they >> want to. >> >> For the record, I was thinking that renaming the scheduler to supervisor >> would likely happen as we re-write it to enable multiple concurrent >> supervisor processes. It turns out that parallelizing the scheduler hasn't >> been as critical as I thought it would be originally, especially with the >> current multi-process scheduler. Sounds like the community is getting a >> lot >> of mileage out of this current multi-process scheduler. >> >> Max >> >> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <jiening...@optiver.com> >> wrote: >> >> > I would love to see this proposal gets implemented in airflow. >> > In our case duration based SLA makes much more sense and I ended up >> adding >> > a decorator to the execute method in our custom operators. >> > >> > Best regards, >> > Jiening >> > >> > -----Original Message----- >> > From: James Meickle [mailto:jmeic...@quantopian.com] >> > Sent: Wednesday 02 May 2018 9:00 PM >> > To: dev@airflow.incubator.apache.org >> > Subject: Improving Airflow SLAs [External] >> > >> > At Quantopian we use Airflow to produce artifacts based on the previous >> > day's stock market data. These artifacts are required for us to trade on >> > today's stock market. Therefore, I've been investing time in improving >> > Airflow notifications (such as writing PagerDuty and Slack >> integrations). >> > My attention has turned to Airflow's SLA system, which has some >> drawbacks >> > for our use case: >> > >> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is >> > skipped for this execution date will still trigger emails/callbacks. >> This >> > is a huge problem for us because we run almost no tasks on weekends >> (since >> > the stock market isn't open). >> > >> > 2) Defining SLAs can be awkward because they are relative to the >> execution >> > date instead of the task start time. There's no way to alert if a task >> runs >> > for "more than an hour", for any non-trivial DAG. Instead you can only >> > express "more than an hour from execution date". The financial data we >> use >> > varies in when it arrives, and how long it takes to process (data volume >> > changes frequently); we also have tight timelines that make retries >> > difficult, so we want to alert an operator while leaving the task >> running, >> > rather than failing and then alerting. >> > >> > 3) SLA miss emails don't have a subject line containing the instance URL >> > (important for us because we run the same DAGs in both >> staging/production) >> > or the execution date they apply to. When opened, they can get hard to >> read >> > for even a moderately sized DAG because they include a flat list of task >> > instances that are unsorted (neither alpha nor topo). They are also >> lacking >> > any links back to the Airflow instance. >> > >> > 4) SLA emails are not callbacks, and can't be turned off (other than >> either >> > removing the SLA or removing the email attribute on the task instance). >> The >> > way that SLA miss callbacks are defined is not intuitive, as in >> contrast to >> > all other callbacks, they are DAG-level rather than task-level. Also, >> the >> > call signature is poorly defined: for instance, two of the arguments are >> > just strings produced from the other two arguments. >> > >> > I have some thoughts about ways to fix these issues: >> > >> > 1) I just consider this one a bug. If a task instance is skipped, that >> was >> > intentional, and it should not trigger any alerts. >> > >> > 2) I think that the `sla=` parameter should be split into something like >> > this: >> > >> > `expected_start`: Timedelta after execution date, representing when this >> > task must have started by. >> > `expected_finish`: Timedelta after execution date, representing when >> this >> > task must have finished by. >> > `expected_duration`: Timedelta after task start, representing how long >> it >> > is expected to run including all retries. >> > >> > This would give better operator control over SLAs, particularly for >> tasks >> > deeper in larger DAGs where exact ordering may be hard to predict. >> > >> > 3) The emails should be improved to be more operator-friendly, and take >> > into account that someone may get a callback for a DAG they don't know >> very >> > well, or be paged by this notification. >> > >> > 4.1) All Airflow callbacks should support a list, rather than requiring >> a >> > single function. (I've written a wrapper that does this, but it would be >> > better for Airflow to just handle this itself.) >> > >> > 4.2) SLA miss callbacks should be task callbacks that receive context, >> like >> > all the other callbacks. Having a DAG figure out which tasks have missed >> > SLAs collectively is fine, but getting SLA failures in a batched >> callback >> > doesn't really make much sense. Per-task callbacks can be fired >> > individually within a batch of failures detected at the same time. >> > >> > 4.3) SLA emails should be the default SLA miss callback function, rather >> > than being hardcoded. >> > >> > Also, overall, the SLA miss logic is very complicated. It's stuffed into >> > one overloaded function that is responsible for checking for SLA misses, >> > creating database objects for them, filtering tasks, selecting emails, >> > rendering, and sending. Refactoring it would be a good maintainability >> win. >> > >> > I am already implementing some of the above in a private branch, but >> I'd be >> > curious to hear community feedback as to which of these suggestions >> might >> > be desirable upstream. I could have this ready for Airflow 2.0 if there >> is >> > interest beyond my own use case. >> > >> > >