kaxil opened a new pull request, #67792: URL: https://github.com/apache/airflow/pull/67792
The `common.ai` provider runs Pydantic AI agents but emits no telemetry about what the agent did: `result.usage()` and the model name are computed and then dropped at DEBUG. This wires Pydantic AI's native OpenTelemetry instrumentation into the OTLP exporter Airflow core already configures under `[traces]`, so agent spans (agent run, model call, tool call, token usage) flow to whatever OpenTelemetry backend the deployment runs (Jaeger, Tempo, Grafana, Phoenix, Langfuse, a bare collector), correlated to the task that produced them. Scope is intentionally narrow: export only. No new metadata tables, no migration, no native trace store, no bundled UI. Viewing traces is the job of the backend the user already runs. ## How it works - Instrumentation is attached once, in `PydanticAIHook.create_agent()`, so every LLM operator (`AgentOperator`, `@task.agent` / `@task.llm`, and the SQL / branch / file-analysis / schema-compare operators) and `LLMRetryPolicy` are covered through a single chokepoint. - It reuses the global `TracerProvider` that core tracing installs; it never configures an exporter or provider of its own. If `[traces] otel_on` is off in the worker process, nothing is emitted and there is zero overhead. - Parenting is implicit: the worker opens the task span (from `TaskInstance.context_carrier`) before `execute()` runs, so the agent spans nest under it and inherit the task's `trace_id` and `airflow.*` attributes. The operator does not re-extract the carrier. - Span flushing is already handled by the worker's existing teardown, so no operator changes were needed. ## Design rationale - **The `agent.instrument` property, not `Agent(instrument=...)`**: the constructor kwarg is deprecated in current pydantic-ai; the property is the stable per-agent surface back to the `pydantic-ai-slim>=1.71` floor. - **`version=4`**: pins the OpenTelemetry GenAI semantic-convention attribute names (`gen_ai.*`) so they do not drift with the pydantic-ai default. - **Provider config, not core `[traces]`**: the transport is shared, but the decision to emit GenAI spans and what content to capture is provider-owned. ## Usage ```ini [traces] otel_on = True [common.ai] otel_export_enabled = True ``` Point the exporter with the standard OpenTelemetry environment variables. For an OTLP/HTTP collector: ```bash export OTEL_TRACES_EXPORTER="otlp_proto_http" export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://otel-collector:4318/v1/traces" ``` ## Content capture and privacy `capture_content` (default off) controls whether prompt, completion, and tool-call text is attached to spans. With it off, only token counts, model id, latency, tool names, and finish reason are recorded. When on, that content is exported to the tracing backend without redaction: Airflow's secret masking is a logging filter and does not apply to span attributes. Enable it only for debugging in a trusted environment. ## Deferred follow-ups - `otel_exporter_endpoint` to route GenAI spans to a different collector than infra traces (needs a second span processor on the provider). - A `redactor` hook to scrub captured content before export. - Tracing the non-pydantic-ai operators (`LlamaIndexRetrievalOperator`, `DocumentLoaderOperator`), which do not build through `create_agent()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
