mjpieters edited a comment on issue #12771:
URL: https://github.com/apache/airflow/issues/12771#issuecomment-753404477


   Interesting little test; but note it generates a trace from the tasks after 
the fact without support for additional spans created __inside__ each task 
invocation, and you the tracing context is not shared overy any RPC / REST / 
etc. calls to other services to inherit the tracing context.
   
   I integrated Jaeger tracing (opentracing) support into a production Airflow 
setup using Celery. Specific challenges I had to overcome:
   
   - The dagrun span is 'virtual', in that it exists from when the dagrun is 
created until the last task is completed. The scheduler will then update the 
dagrun state in the database. But tasks need a valid parent span to attach 
their own spans to.
   
     I solved this by creating a span in a monkey-patch to 
`DAG.create_dagrun()`, injecting the span info in to the dagrun configuration 
together with the start time, then _discarding the span_.  Then, in 
`DagRun.set_state()`, when the state changes to a finished state, I create a 
`jaeger_client.span.Span()` object from scratch using the dagrun conf-stored 
data, and submit that to Jaeger.
   
   - Tasks inherit the parent (dagrun) span context from the dagrun config; I 
patched `Task.run_raw_task()` to run the actual code under a 
`tracer.start_active_span()` context manager. This captures timing and any 
exceptions.
   
   - You need an active tracer for traces to be captured and sent on to the 
tracer agent. So I registered code to run in the 
`cli_action_loggers.register_pre_exec_callback()` hook when the `scheduler` or 
`dag_trigger` sub-commands run, which then registers a closer with 
`cli_action_loggers.register_post_exec_callback`. Closing a tracer in 
`dag_trigger` takes careful work with the asyncio / tornado loop used by the 
Jaeger client, you'll lose traces if you don't watch out. I found that you had 
to go hunt for the I/O loop attached to the trace reporter object and call 
`tracer.close()` from a callback sent to that loop as the only fail-proof 
method of getting the last traces out. I don't know if opentracing needs this 
level of awareness of the implementation details.
   
   - You generally want to start a trace when the Airflow webserver receives a 
trigger, so instrument the Flask layer too. This is where the sampling decision 
needs to be taken too; tracing often only instruments a subset of all jobs, but 
in this project I set the sampling frequency to 100% at this level.
   
   - I added a custom configuration section to airflow.cfg to allow me to tweak 
Jaeger tracing parameters.
   
   But, with that work in place, we now get traces in mostly real time, with 
full support for tracing contexts being shared with other services called from 
Airflow tasks. We can trace a job through the frontend, submitting a job to 
Airflow, then follow any calls from tasks to further REST APIs, all as one 
system.
   
   I'd prefer it if the tracing context was not shoehorned into the dagrun 
configuration; I'd have created additional database tables or columns for this 
in the Airflow database model if I had to do this inside the Airflow project 
itself.
   
   Note that I did *not* use the Celery task hooks here to track the task 
spans, because Celery has its own overhead that we wanted to keep separate. The 
`opentracing_instrumentation` already has Celery hooks you can use, but  I 
needed the timings to be closer to the actual task invocation.
   
   Anothing thing to consider is per-DAG or per-task tagging you want to add to 
the spans. For this project I needed to track specific data from the submitted 
DAG config so we can compare task runs using the same input configuration.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to