1fanwang opened a new pull request, #66807:
URL: https://github.com/apache/airflow/pull/66807

   ### Problem
   
   `dagrun.first_task_scheduling_delay` measures `data_interval_end → 
first_start_date`, which conflates two distinct latencies: scheduler latency to 
enqueue the first task, and executor latency to pick the task up. When a Dag 
run's first task starts late, that single timer can't tell ops which phase is 
slow.
   
   The executor-pickup portion (`queued_at → first_start_date`) has no metric 
today.
   
   ### Fix
   
   Add `dagrun.first_task_start_delay`, computed as `first_start_date - 
queued_at` on Dag run completion, tagged by `dag_id` and `run_type` to match 
the existing tag shape on `first_task_scheduling_delay`. It is emitted next to 
the existing scheduling-delay metric, only when `queued_at` is set and the 
delta is positive. The existing metric is unchanged.
   
   ### Tests
   
   `test_emit_first_task_start_delay` constructs a scheduled Dag run with a 
known `queued_at` and a known first-task `start_date`, calls `update_state`, 
mocks `stats.timing`, and asserts the new metric is emitted with the expected 
delta and tags. A parametrised case with `queued_at = None` confirms the new 
metric stays off when no `queued_at` is recorded.
   
   Closes #66802
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to