GitHub user joeyutong added a comment to the discussion: Parallel Tool Call Execution
@da-daken Sorry for the late reply. Thanks for raising this and writing up the proposal. I agree the problem is real: the current `tool_call_action` processes tool calls serially. Even with `tool-call.async=true`, `durableExecuteAsync` gives us inter-action concurrency, but not parallel execution of multiple tool calls within the same `ToolRequestEvent`. Overall, I support solving this latency problem. Before implementation, I think there are a few semantics we should clarify. ## Recovery state machine One detail I think needs more clarification is how the proposed batch slots fit into the current fine-grained durable execution state machine. Today the recovery model is cursor-based: durable calls are matched by `currentCallIndex`, and the cursor advances one call at a time. A pending durable call also occupies the current cursor position until it is finalized. The proposal mentions individual `PENDING` slots and independent reconcilers for tools in the same batch. That seems to require multiple active batch slots under one logical durable execution step, while the current state machine only advances one cursor position at a time. So before implementing this, I think we should clarify how batch slots interact with the existing cursor advancement model. Otherwise the recovery semantics may be ambiguous, especially when failover happens after some tool calls have been submitted but before the batch result is fully persisted. ## Failure semantics Current `tool_call_action` has collect-all behavior: one tool failure is captured in the `ToolResponseEvent`, and other tools can still succeed. A generic `durableExecuteAllAsync(List<DurableCallable<T>>)` can easily become fail-fast if one future throws. For tool calls, I think we should preserve collect-all semantics and return one result/error per tool call, unless we intentionally want to change the external behavior. ## Side effects and duplicate calls Parallel execution increases the number of in-flight external tool calls. When failover happens during a parallel batch, there may be multiple submitted-but-not-yet-persisted external calls at the same time, so the chance of duplicate tool calls after recovery is higher than in the current serial flow. I think the proposal should explicitly document this behavior and how it interacts with reconcilers or external idempotency. ## Concurrency limits `num-async-threads` is global, but one `ToolRequestEvent` with many tool calls could occupy the whole pool and affect other keys/actions. Should we add a per-batch or per-tool-call parallelism limit, or at least document that concurrency is only bounded by the global async pool in the first version? ## Trace / event visibility If we plan to add tool-call-level events later, parallel tool calls should probably align with that direction. With concurrent execution, it becomes more important to have per-tool start/end/status/latency visibility, otherwise debugging a slow or failed batch will be difficult. This may not need to block the first implementation, but the design should leave a clear path for tool-call-level tracing/events. GitHub link: https://github.com/apache/flink-agents/discussions/855#discussioncomment-17471103 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
