joeyutong opened a new issue, #859: URL: https://github.com/apache/flink-agents/issues/859
### Search before asking - [x] I searched in the [issues](https://github.com/apache/flink-agents/issues) and found nothing similar. ### Description Action-scoped metrics can be recorded under the wrong action when a cached resource is reused across actions. Flink Agents injects the current action metric group into a resource when `RunnerContext.get_resource(...)` / `RunnerContext.getResource(...)` returns it. However, resources are cached and shared. If action A gets a chat model resource, then yields or waits across an async/durable boundary, action B can later get the same cached resource and overwrite its metric group with B's action scope. When action A resumes and records token metrics by reading the metric group from the resource field, those metrics may be recorded under B's action scope. This can affect paths where metrics are recorded after the request returns rather than at the moment the action obtains the resource. For example: - Python chat token metrics after `durable_execute` / `durable_execute_async`. - Java chat token metrics after the chat response is returned. - Shared cached resources used by multiple actions. - Cross-language wrappers and provider resources, where the wrapper and underlying provider may have separate metric group state. The expected behavior is that token metrics are recorded under the action scope that initiated the request. The metric group used for delayed metric recording should not depend on mutable state stored on a cached resource after another action may have rebound it. ### How to reproduce One minimal reproduction shape is: 1. Define two actions that share the same chat model resource. 2. Let action A obtain the chat model resource and start a chat request. 3. Before action A records token metrics, let action B obtain the same cached chat model resource, causing the resource metric group to be rebound to B's action scope. 4. Resume action A and record token metrics from the chat response. 5. Observe that the token counters can be registered under B's action metric scope instead of A's. A unit-level reproduction can simulate the same condition by: 1. Creating a setup whose connection wraps metric groups with provider dimensions. 2. Binding the setup to action A's metric group. 3. Rebinding the same setup/resource to action B's metric group before recording token metrics. 4. Recording token metrics through the setup. 5. Verifying that the counter follows the latest mutable resource metric group rather than the action A group that initiated the request. ### Version and environment Observed from the current main-branch code path: - Python `FlinkRunnerContext.get_resource(...)` binds the current action metric group to the cached resource. - Java `RunnerContextImpl.getResource(...)` binds the current action metric group to the cached resource. - Python and Java chat token metrics are recorded after the chat response returns and read the metric group from the chat model resource. This is independent of a specific deployment mode; it is a resource lifecycle / metric binding issue. ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
