GitHub user joeyutong added a comment to the discussion: Parallel Tool Call 
Execution

@da-daken Sorry for the late reply. Thanks for raising this and writing up the 
proposal.

I agree the problem is real: the current `tool_call_action` processes tool 
calls serially. Even with `tool-call.async=true`, `durableExecuteAsync` gives 
us inter-action concurrency, but not parallel execution of multiple tool calls 
within the same `ToolRequestEvent`.

Overall, I support solving this latency problem. Before implementation, I think 
there are a few semantics we should clarify.

## Recovery state machine

One detail I think needs more clarification is how the proposed batch slots fit 
into the current fine-grained durable execution state machine.

Today the recovery model is cursor-based: durable calls are matched by 
`currentCallIndex`, and the cursor advances one call at a time. A pending 
durable call also occupies the current cursor position until it is finalized.

The proposal mentions individual `PENDING` slots and independent reconcilers 
for tools in the same batch. That seems to require multiple active batch slots 
under one logical durable execution step, while the current state machine only 
advances one cursor position at a time.

So before implementing this, I think we should clarify how batch slots interact 
with the existing cursor advancement model. Otherwise the recovery semantics 
may be ambiguous, especially when failover happens after some tool calls have 
been submitted but before the batch result is fully persisted.

## Failure semantics

Current `tool_call_action` has collect-all behavior: one tool failure is 
captured in the `ToolResponseEvent`, and other tools can still succeed.

A generic `durableExecuteAllAsync(List<DurableCallable<T>>)` can easily become 
fail-fast if one future throws. For tool calls, I think we should preserve 
collect-all semantics and return one result/error per tool call, unless we 
intentionally want to change the external behavior.

## Side effects and duplicate calls

Parallel execution increases the number of in-flight external tool calls. When 
failover happens during a parallel batch, there may be multiple 
submitted-but-not-yet-persisted external calls at the same time, so the chance 
of duplicate tool calls after recovery is higher than in the current serial 
flow.

I think the proposal should explicitly document this behavior and how it 
interacts with reconcilers or external idempotency.

## Concurrency limits

`num-async-threads` is global, but one `ToolRequestEvent` with many tool calls 
could occupy the whole pool and affect other keys/actions.

Should we add a per-batch or per-tool-call parallelism limit, or at least 
document that concurrency is only bounded by the global async pool in the first 
version?

## Trace / event visibility

If we plan to add tool-call-level events later, parallel tool calls should 
probably align with that direction. With concurrent execution, it becomes more 
important to have per-tool start/end/status/latency visibility, otherwise 
debugging a slow or failed batch will be difficult.

This may not need to block the first implementation, but the design should 
leave a clear path for tool-call-level tracing/events.

GitHub link: 
https://github.com/apache/flink-agents/discussions/855#discussioncomment-17471103

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to