Hi all,

I’m working with Dennis on Deadline Alerts (AIP-86). I'd like to discuss 
implementation approaches for executing callbacks when Deadline Alerts are 
triggered. As you may know, the old SLA feature has been removed, and we're 
planning to introduce Deadline Alerts as a replacement in 3.1. When a deadline 
is missed, we need a mechanism to execute callbacks (which could be 
notifications or other actions).

I’ve identified two main approaches:

Option 1: Scheduler-based
In this approach, the scheduler would check on a regular interval to see if the 
earliest deadline has passed and then queue the callback to run in an executor 
(local or remote). The executor would be specified when creating the deadline 
alert and if there’s none specified, then the default executor would be used.

Option 2: New DeadlineProcessor process
In this approach, there would be a new process similar to 
triggerer/dag-processor completely independent from the scheduler to check for 
deadlines on a regular interval and also run the callbacks without queueing it 
in another executor.

Multi-team considerations: For multi-team later this year, option 2 would be 
relatively simple to implement. However, for option 1, the callbacks would have 
to run on a remote executor since there would be no local executor.

I recommend going with option 2 because:

  *   It would be more robust and resilient, and therefore be able to run the 
callbacks even in presence of certain kinds of issues like the scheduler being 
bogged-down
  *   It would also run the callbacks almost instantly instead of having to 
wait for an executor (especially if there’s a long queue of tasks or a 
cold-start delay)
     *   This could be mitigated by implementing a priority system where the 
deadline callbacks are prioritized over regular tasks but this is a non-trivial 
problem with my current understanding of Airflow’s architecture
  *   It would avoid a potential slight increase in workload for the scheduler
     *   The additional workload in the scheduler for option 1 would be 
checking to see if the earliest deadline has passed on a regular interval

However, it would introduce another process for admins to deploy and manage, 
and also likely require more effort to implement, therefore taking longer to 
complete.

So, I’d like to hear your thoughts on these approaches, anything I may have 
missed and if you agree/disagree with this direction. Thank you for your input!


Best,

Ramit Kataria
SDE at AWS

Reply via email to