kaxil opened a new pull request, #68926: URL: https://github.com/apache/airflow/pull/68926
## Summary `durable=True` on `AgentOperator` / `@task.agent` caches each model response and tool result so a retry replays completed steps instead of re-running expensive LLM calls. Until now that cache was always an ObjectStorage JSON file, which required `[common.ai] durable_cache_path` to be set. The durable cache is per-task-instance, per-run, and cleared on success, which is exactly the scope of the [AIP-103 task state store](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/task_state_store.py). On **Airflow >= 3.3** the cache now lives there: - No `durable_cache_path` configuration needed. - Large step payloads offload automatically when `[workers] state_store_backend` is set. - Cached steps are deleted on success, and reclaimed by the DAG-run cascade if a run fails permanently. On **Airflow < 3.3** the ObjectStorage backend is unchanged and selected automatically. ## Design - Both backends implement a shared `DurableStorageProtocol`, so `CachingModel` and `CachingToolset` depend on the interface, not a concrete backend. - `AgentOperator._build_durable_storage` picks the backend by Airflow version. - Durable keys use a reserved prefix (`__commonai_durable__...`) so they cannot collide with keys user code writes via `context["task_state_store"]` in the same task instance. - Entries are written with `NEVER_EXPIRE` so a retry can replay them regardless of `retry_delay` or the global retention config; `cleanup()` removes only the keys this run touched. ## Gotchas - On a default 3.3+ install (no `state_store_backend`), cached step values are stored in the metadata database. For agents with large responses, configure `[workers] state_store_backend` to offload them to external storage. - A tool passed via `agent_params={"tools": [...]}` is not wrapped by the caching toolset (only `toolsets=` are), so only its model steps replay. This is existing behavior, unchanged here. ## Verification Verified end-to-end on a real Airflow 3.3 standalone: a durable `AgentOperator` whose tool fails on the first attempt replays the cached model step from the task state store on the retry, and a direct round-trip task confirms cache, replay, and cleanup against the live Execution API and metadata DB. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
