syun64 opened a new issue, #29663: URL: https://github.com/apache/airflow/issues/29663
### Description With recent PRs enabling tags-support on Statsd metrics, we gained a deeper understanding into the issue of publishing high cardinality metrics. Through this issue, I hope to facilitate the discussion in categorizing metric cardinality of Airflow specific events and tags, and finding a way to disable high cardinality metrics and including it into 2.6.0 release In the world of Observability & Metrics, cardinality is broadly defined as the following: `number of unique metric names * number of unique application tag pairs` This means that events with _unbounded_ number of tag-pairs (key value pair of tags) as well as events with _unbounded_ number of unique metric names will incur expensive storage requirements on the metrics backend. Let's take a look at the following metric: `local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code>` Here, we have 4 different variable/tag-like attributes embedded into the metric name that I think we can categorize into 3 levels of cardinality. 1. High cardinality / Unbounded metric 2. Medium cardinality / semi-bounded metric 3. Low cardinality / categorically-bounded metric ### High Cardinality / Unbounded Metric Example tag: <job_id> This category of metrics are strictly unbounded, and incorporates a monotonically increasing attribute like <job_id> or <run_id>. To demonstrate just how explosive the growth of these metrics can be, let's take an example. In an Airflow instance with 1000 daily jobs, with a metric retention period of 10 days, we are increasing the cardinality of our metrics by 10,000 on just one single metric just by adding this tag alone. If we add this tag to a few other metrics, that could easily result in an explosion of metric cardinality. As a benchmark,[ DataDog's Enterprise level pricing plan only has 200 custom metrics per host included](https://www.datadoghq.com/pricing/), and anything beyond that needs to be added at a premium. These metrics should be avoided at all costs. ### Medium Cardinality / semi-bounded metric Example tag: <dag_id>, <task_id> This category of metrics are semi-bounded. They are not bounded by a pre-defined category of enums, but they are bounded by the number of dags or tasks there are within an Airflow infrastructure. This means that although these metrics can lead to increasing levels of cardinality in an Airflow cluster with increasing number of dags, cardinality will still be temporarily bounded. I.e. a given cluster will maintain its level of cardinality over time. ### Low Cardinality / categorically-bounded metric Example tag: <return_code> This category of metrics is strictly bounded by a category of enums. <return_code> and <task_state> are good examples of attributes with low cardinality. Ideally, we would only want to publish metrics with this level of cardinality. Using above definition of High Cardinality, I've identified the following metrics as examples that fall under this criteria. https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L292 https://github.com/apache/airflow/blob/main/airflow/dag_processing/processor.py#L444 https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job.py#L691 https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job.py#L1584 https://github.com/apache/airflow/blob/main/airflow/models/dag.py#L1331 https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1258 https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1577 https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1847 I would like to propose that we need to provide the option to disable 'Unbounded metrics' with 2.6.0 release. In order to ensure backward compatibility, we could leave the default behavior to publish all metrics, but implement a single Boolean flag to disable these high cardinality metrics. ### Use case/motivation _No response_ ### Related issues https://github.com/apache/airflow/pull/28961 https://github.com/apache/airflow/pull/29093 ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org