I was able to chat with a couple folks about this. Small sample, but the sentiment was, "this is just a timeout". In other words, if we're going to call this SLA, we really ought to evaluate against the "this thing should have run by" time and not the actual start time. And, ideally, we should also have a way to enforce "this should have run by X time daily" (for example) even when it's a dataset-triggered or API-triggered dag with *no* schedule.
Like I said, it's a small number of folks I've talked to, so I don't have overwhelming confidence about this assessment. But I do think it's more likely than not that this would be the prevailing assessment were somehow able to get better data on this.