GitHub user pltbkd added a comment to the discussion: [Discussion] Supervisor + 
Sub-Agent Orchestration: Concrete Multi-Agent Use Cases for Event-Driven Agents

Hi, @weiqingy. Thanks a lot for raising this discussion. I'm currently 
investigating the multi-agent framework, including its integration with the RPC 
Operator proposed in FLIP-577.

The supervisor+subagent pattern is a very typical application scenario. To 
facilitate a more productive discussion, I suggest we separate two concerns: 
(1) the definition and execution of the subagent itself, and (2) the 
orchestration and execution of the overall workflow.

1. Workflow Orchestration and Execution

First, I strongly agree with your idea of "callable resource". I believe the 
introduction of callable resources will bring significant and beneficial 
changes to the orchestration of subagents—and indeed to the overall 
orchestration approach and usability of flink-agent.

Historically, flink-agent has adopted an event-driven execution model, 
including coordination among actions within an agent. User job orchestration 
has been built around this paradigm. This feels natural in a purely pipeline: 
each participant completes their task and hands it off to the next, without 
worrying about who picks it up next. However, in a subagent architecture, the 
main agent needs to perform further processing based on the subagent's 
execution results, and the current design becomes less user-friendly.

Actually, LLM actions face a similar issue. flink-agent requires users to split 
LLM input preparation and output handling into separate steps, manually 
subscribing to and processing events. To implement a logical operation A that 
calls a model, users must implement two Actions: Action-A (to produce the chat 
request) and Action-A' (to handle the chat response). This is not only hard to 
work with, but also changes how users expect the system to behave. For example, 
in the diagram below: (1) represents the user's logical intent, (2) is how the 
user expects the execution to flow, (3) is how flink-agent actually executes 
it, and (4), describes how user may feel about the execution model-as if all 
user Actions are serving the LLM, rather than orchestrate their own business 
logic. I guess this is why you think flink-agent as being LLM-centric.
<img width="479" height="356" alt="image" 
src="https://github.com/user-attachments/assets/d09b8ce1-73c7-43c1-acce-05fae7d4805b";
 />

Building on this, I've rethought about flink-agent's current APIs and execution 
model, and arrived at conclusions very similar to the "callable resource" 
concept. We should provide users with a new request-response style interaction 
paradigm for orchestration, rather than being limited to event-triggered flows. 
LLM calls and subagents naturally fit the former, while the latter still holds 
value for decoupled orchestration and flexible subscription. The two paradigms 
can complement each other, and the event-driven approach can still be used for 
orchestrating complex subagent workflows.

Users would interact via a new `call` + `await` interface. At the 
implementation level, we can wrap subagents, LLMs, and other callable resources 
as Actions, automatically orchestrating their request/response flows: `call` 
sends the request (with framework-provided request wrapper, response 
dispatcher, and completion signaling), while `await` waits for the completion 
signal and retrieves the result, similar to execute_async. This approach can 
reuse most of our existing capabilities, including supporting parallel subagent 
invocations by executing actions in parallel, which is already validated. 
(Though minor enhancements are still needed; I'll raise a separate discussion 
on that.)

2. Definition and Execution of the Subagent Itself

Building on the foundation above, how users define and use subagents becomes 
clear: a subagent can be as simple as a single Action, or a complex workflow 
orchestrated via event subscription; at runtime, it can be wrapped as a 
callable resource and directly called from the main agent, while the framework 
internally continues to use event-based subscription and scheduling.

However, a subagent entails more than just executing an Action or Action chain. 
It may also require: isolated context, an independent toolset, specialized 
prompts, dedicated compute resources, and more. I haven't deeply analyzed the 
requirements specific to subagents yet. Please feel free to share your ideas.

Regarding subagent execution: based on the approach outlined in section 1, we 
can already run subagents within the same TaskManager. However, due to the GIL, 
this model cannot support multiple subagents running concurrently. This may 
suffice for simple, LLM-centric logic, but for more complex scenarios, we 
likely need to run subagents in isolated processes or dedicated external 
resources—to prevent subagents from affecting the main agent's stability or 
competing for its compute resources.

Currently, the RPC Operator planned in FLIP-577 appears to be a promising 
option. As Flink's new infrastructure for AI workloads, it enables unified 
lifecycle and resource management at the job level, while supporting flexible, 
independent scaling, fault tolerance, and targeted communication optimizations. 
We can keep an eye on it.

GitHub link: 
https://github.com/apache/flink-agents/discussions/660#discussioncomment-16902363

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to