AutomationDev85 commented on issue #41822:
URL: https://github.com/apache/airflow/issues/41822#issuecomment-2642881783

   Hi all, 
   I wanted to switch our Airflow to use OTEL and run into the same issue. I 
debugged into the issue and found:
   
   1) metrics like "ti.start" and "ti.finish" are exported in the worker 
context with label dag_id and task_id. The metrics are only available during 
the time where the task is running. Looks metrics are gone after task finished 
and then the metric is removed after ~ 5min from the OtelCollector. Maybe 
because the metrics like "ti.start" are exported in the worker context an the 
OtelLogger is gone if the worker task is finished?
   
   2) I´m not sure if singleton instance can solve the issue as Airflow is able 
to handle multiple workers, schedulers and .... For me it looks like the 
metrics which have the same labels are overwriting each other if they are 
exported by two different Pods (E.g. 2 workers). So not possible to increase a 
counter like "ti.start" with label "dag_id" and "task_id" if 2 dag_runs are 
running in parallel on different workers.
   Expecting:
   airflow_ti_start{dag_id="dag1", task_id="task1"} 2
   but getting:
   airflow_ti_start{dag_id="dag1", task_id="task1"} 1
   
   
   Two main issue to solve:
   1) Any idea how to make the OtelLogger static for one Pod, but then we still 
have to solve 2)?
   2) How to get trust able metrics in multi Pod deployments.
   
   I write here to start the discussion again and get also some feedback from 
you all to improve the thinks. Not sure about the best way to fix this issue. 
We need to solve both points to get usable metrics via Otel.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to