syun64 opened a new issue, #29663:
URL: https://github.com/apache/airflow/issues/29663

   ### Description
   
   With recent PRs enabling tags-support on Statsd metrics, we gained a deeper 
understanding into the issue of publishing high cardinality metrics. Through 
this issue, I hope to facilitate the discussion in categorizing metric 
cardinality of Airflow specific events and tags, and finding a way to disable 
high cardinality metrics and including it into 2.6.0 release
   
   In the world of Observability & Metrics, cardinality is broadly defined as 
the following:
   
   `number of unique metric names * number of unique application tag pairs`
   
   This means that events with _unbounded_ number of tag-pairs (key value pair 
of tags) as well as events with _unbounded_ number of unique metric names will 
incur expensive storage requirements on the metrics backend.
   
   Let's take a look at the following metric:
   
   `local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code>`
   
   Here, we have 4 different variable/tag-like attributes embedded into the 
metric name that I think we can categorize into 3 levels of cardinality.
   
   1. High cardinality / Unbounded metric
   2. Medium cardinality / semi-bounded metric
   3. Low cardinality / categorically-bounded metric
   
   ### High Cardinality / Unbounded Metric
   Example tag: <job_id>
   
   This category of metrics are strictly unbounded, and incorporates a 
monotonically increasing attribute like <job_id> or <run_id>. To demonstrate 
just how explosive the growth of these metrics can be, let's take an example. 
In an Airflow instance with 1000 daily jobs, with a metric retention period of 
10 days, we are increasing the cardinality of our metrics by 10,000 on just one 
single metric just by adding this tag alone. If we add this tag to a few other 
metrics, that could easily result in an explosion of metric cardinality. As a 
benchmark,[ DataDog's Enterprise level pricing plan only has 200 custom metrics 
per host included](https://www.datadoghq.com/pricing/), and anything beyond 
that needs to be added at a premium. These metrics should be avoided at all 
costs.
   
   ### Medium Cardinality / semi-bounded metric
   Example tag: <dag_id>, <task_id>
   
   This category of metrics are semi-bounded. They are not bounded by a 
pre-defined category of enums, but they are bounded by the number of dags or 
tasks there are within an Airflow infrastructure. This means that although 
these metrics can lead to increasing levels of cardinality in an Airflow 
cluster with increasing number of dags, cardinality will still be temporarily 
bounded. I.e. a given cluster will maintain its level of cardinality over time.
   
   ### Low Cardinality / categorically-bounded metric
   Example tag: <return_code>
   
   This category of metrics is strictly bounded by a category of enums. 
<return_code> and <task_state> are good examples of attributes with low 
cardinality. Ideally, we would only want to publish metrics with this level of 
cardinality.
   
   Using above definition of High Cardinality, I've identified the following 
metrics as examples that fall under this criteria.
   
   
https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L292
   
https://github.com/apache/airflow/blob/main/airflow/dag_processing/processor.py#L444
   
https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job.py#L691
   
https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job.py#L1584
   https://github.com/apache/airflow/blob/main/airflow/models/dag.py#L1331
   
https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1258
   
https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1577
   
https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1847
   
   I would like to propose that we need to provide the option to disable 
'Unbounded metrics' with 2.6.0 release. In order to ensure backward 
compatibility, we could leave the default behavior to publish all metrics, but 
implement a single Boolean flag to disable these high cardinality metrics.
   
   ### Use case/motivation
   
   _No response_
   
   ### Related issues
   
   https://github.com/apache/airflow/pull/28961
   https://github.com/apache/airflow/pull/29093
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to