ferruzzi commented on issue #68294:
URL: https://github.com/apache/airflow/issues/68294#issuecomment-4725975650
For posterity: In order to check if I got them all, I made a list of the
ones I have included in these 7 PRs then pointed Claude Opus 4.8 at the
codebase with the following prompt:
```
A metric can be tied to a team if:
- one of the following is available at the emit site: dag_id, bundle_name,
pool, team-scoped executor, team-scoped triggerer, connection-test row,
TaskInstance / RuntimeTaskInstance, or DagRun context
- AND it is NOT a shared-infrastructure or cross-team-aggregate metric
because those cannot be attributed to any single team even when a team identity
is technically nearby.
Metrics which do not have emit sites, metrics which only exist in providers,
and metrics which can not be tied to a team are considered out of scope for
this discussion.
Metrics which are included in the PR list below are considered done and
should not be included in the list of remaining work.
1. Read the list of metrics from the metrics registry at
`shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml`
2. For each metric in the registry, locate and read its emit site and
determine if it is in scope
3. Generate a report which contains three lists:
- 1) Done: Metrics which are included in the PR list below, or whose
emitted tags include team_name either directly or via an indirect source such
as stats_tags/metric_tags
- 2) Remaining work: Metrics which can be linked to a Team and do not
include the team_name tag.
- 3) Out of Scope: Metrics which are out of scope, sorted by the reason
(legacy, provider, or non-team)
4. Finally, cross-check those lists to ensure that every metric you found
exists on one and only one of those lists.
Related PRs:
- 68108
- 68367
- 68589
- 68593
- 68594
- 68599
- 68601
```
and this was the generated report:
# Multi-Team Metrics: Final Count
Categorization of all 125 metrics in the registry (`metrics_template.yaml`),
based on the 7 related PRs and a trace of every emit site.
## How the 7 PRs map
The PRs explicitly tag 47 metrics. A further 9 are already covered
indirectly
because they emit through `DagRun.stats_tags` / `TaskInstance.stats_tags` /
`RuntimeTaskInstance.stats_tags`, all of which were given `team_name` in
68108.
So 56 are Done.
## 1) Done (56)
Directly tagged by a listed PR:
- 68108 (DagRun/TI stats_tags + pools + SDK): `pool.open_slots`,
`pool.queued_slots`, `pool.running_slots`, `pool.deferred_slots`,
`pool.scheduled_slots`, `pool.starving_tasks`, `ti.start`,
`operator_successes`, `operator_failures`, `ti_successes`, `ti_failures`,
`task.duration`
- 68367 (assets): `asset.updates`, `asset.triggered_dagruns`
- 68589 (deadlines): `deadline_alerts.deadline_created`,
`deadline_alerts.deadline_missed`, `deadline_alerts.deadline_not_missed`
- 68593 (executors): `executor.open_slots`, `executor.queued_tasks`,
`executor.running_tasks`
- 68594 (scheduler): `scheduler.tasks.killed_externally`,
`dagrun.schedule_delay`, `dagrun.duration.failed`, `ti.scheduled`, `ti.queued`,
`ti.running`, `ti.deferred`, `task_instances_without_heartbeats_killed`
- 68599 (dag processor): `dag_processing.other_callback_count`,
`dag_processing.last_run.seconds_ago`, `dag_processing.processes`,
`dag_processing.processor_timeouts`, `dag_processing.callback_only_count`,
`dag_processing.last_duration`, `dag.callback_exceptions`
- 68601 (stragglers): `triggerer_heartbeat`, `triggers.succeeded`,
`triggers.failed`, `triggers.running`, `triggerer.capacity_left`,
`task.scheduled_duration`, `task.queued_duration`,
`resumable_job.fresh_submit`, `resumable_job.already_succeeded`,
`resumable_job.terminal_resubmit`, `resumable_job.reconnect_attempt`,
`resumable_job.reconnect_success`
Done indirectly (emit through a stats_tags source that now carries
`team_name`):
- `ti.finish` (task-sdk `run()`, shares the `ti.stats_tags` variable)
- `previously_succeeded` (`taskinstance.py`, `ti.stats_tags`)
- `task_removed_from_dag`, `task_restored_to_dag`, `task_instance_created`
(`dagrun.py`, `self.stats_tags`)
- `dagrun.dependency-check`, `dagrun.first_task_scheduling_delay`,
`dagrun.first_task_start_delay`, `dagrun.duration.success` (`dagrun.py`,
`self.stats_tags`)
## 2) Remaining work (6)
Team-attributable, emit site has a team identity, but no `team_name` tag yet:
- `connection_test.reaped` (scheduler reaper; `ct.team_name` is literally in
scope and logged one line above the `stats.incr`, only `prior_state` is tagged)
- `connection_test.success` (worker connection-test supervisor; per
connection-test row)
- `connection_test.failed` (same)
- `connection_test.hook_duration` (same)
- `scheduler.executor_heartbeat_duration` (loops over `self.executors`; each
`executor.team_name` is available, tagged only by executor class)
- `triggers.blocked_main_thread` (emitted in a team-scoped triggerer's
`TriggerRunner`; `team_name` lives on the runner/supervisor but is not plumbed
to this emit)
Two of these are more borderline than the others:
`scheduler.executor_heartbeat_duration`
and `triggers.blocked_main_thread` are component-level (team-scoped executor
/
team-scoped triggerer). They satisfy the signal list, but if those
components are
considered "shared," they'd drop to non-team. The four `connection_test.*`
rows are
clean Remaining items per the explicit "connection-test row" signal.
## 3) Out of Scope
Sorted by reason.
### Legacy / no emit site (6)
- `local_task_job.task_exit` (only in the validator regex; no emit)
- `dag_file_processor_timeouts` (marked DEPRECATED; no emit)
- `dag_processing.manager_stalls` (no emit)
- `dag_file_refresh_error` (no emit)
- `dag_processing.last_num_of_db_queries.{dag_file}` (stored on
`DagFileStat`, never emitted)
- `collect_db_dags` (no emit anywhere)
### Provider-only (28)
- celery: `celery.task_timeout_error`, `celery.execute_command.failure`
- openlineage: `ol.emit.failed`, `ol.event.size`, `ol.emit.attempts`,
`ol.extract`
- edge3 worker: `edge_worker.status`, `edge_worker.connected`,
`edge_worker.maintenance`, `edge_worker.jobs_active`,
`edge_worker.concurrency`, `edge_worker.free_concurrency`,
`edge_worker.num_queues`, `edge_worker.heartbeat_count`,
`edge_worker.ti.start`, `edge_worker.ti.finish`
- edge3 executor: `edge_executor.sync.duration`
- cncf.kubernetes: `kubernetes_executor.pod_creation_status`,
`kubernetes_executor.pod_deletion_status`,
`kubernetes_executor.pod_patching_status`,
`kubernetes_executor.clear_not_launched_queued_tasks.duration`,
`kubernetes_executor.adopt_task_instances.duration`,
`kubernetes_executor.pod_creation`, `kubernetes_executor.pod_deletion`,
`kubernetes_executor.pod_patching`
- amazon: `batch_executor.adopt_task_instances.duration`,
`ecs_executor.adopt_task_instances.duration`,
`lambda_executor.adopt_task_instances.duration`
### Non-team (shared infrastructure or cross-team aggregate) (29)
- Job/component lifecycle: `{job_name}_start`, `{job_name}_end`,
`{job_name}_heartbeat_failure`, `scheduler_heartbeat`, `dag_processor_heartbeat`
- Scheduler aggregates/infra: `scheduler.orphaned_tasks.cleared`,
`scheduler.orphaned_tasks.adopted`, `scheduler.critical_section_busy`,
`scheduler.tasks.starving`, `scheduler.tasks.executable`,
`scheduler.dagruns.running`, `scheduler.critical_section_duration`,
`scheduler.critical_section_query_duration`, `scheduler.scheduler_loop_duration`
- Dag-processing aggregates: `dag_processing.file_path_queue_update_count`,
`dag_processing.import_errors`, `dag_processing.total_parse_time`,
`dag_processing.file_path_queue_size`, `dagbag_size`
- Assets aggregate: `asset.orphaned`
- API-server shared cache: `api_server.dag_bag.cache_hit`,
`api_server.dag_bag.cache_miss`, `api_server.dag_bag.cache_clear`,
`api_server.dag_bag.cache_size`
- Connection-test aggregates: `connection_test.active`,
`connection_test.pending`, `connection_test.dispatch_duration`
- SDK startup infra: `airflow.io.load_filesystems`, `serde.load_serializers`
## 4) Cross-check
- Done 56 + Remaining 6 + Legacy 6 + Provider 28 + Non-team 29 = 125, which
equals the registry count.
- Each metric appears on exactly one list; no metric is on two lists, and
none is unaccounted for.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]