GitHub user pltbkd added a comment to the discussion: [Discussion] Supervisor + Sub-Agent Orchestration: Concrete Multi-Agent Use Cases for Event-Driven Agents
Hi, @weiqingy. Thanks a lot for raising this discussion. I'm currently investigating the multi-agent framework, including its integration with the RPC Operator proposed in FLIP-577. The supervisor+subagent pattern is a very typical application scenario. To facilitate a more productive discussion, I suggest we separate two concerns: (1) the definition and execution of the subagent itself, and (2) the orchestration and execution of the overall workflow. 1. Workflow Orchestration and Execution First, I strongly agree with your idea of "callable resource". I believe the introduction of callable resources will bring significant and beneficial changes to the orchestration of subagents—and indeed to the overall orchestration approach and usability of flink-agent. Historically, flink-agent has adopted an event-driven execution model, including coordination among actions within an agent. User job orchestration has been built around this paradigm. This feels natural in a purely pipeline: each participant completes their task and hands it off to the next, without worrying about who picks it up next. However, in a subagent architecture, the main agent needs to perform further processing based on the subagent's execution results, and the current design becomes less user-friendly. Actually, LLM actions face a similar issue. flink-agent requires users to split LLM input preparation and output handling into separate steps, manually subscribing to and processing events. To implement a logical operation A that calls a model, users must implement two Actions: Action-A (to produce the chat request) and Action-A' (to handle the chat response). This is not only hard to work with, but also changes how users expect the system to behave. For example, in the diagram below: (1) represents the user's logical intent, (2) is how the user expects the execution to flow, (3) is how flink-agent actually executes it, and (4), describes how user may feel about the execution model-as if all user Actions are serving the LLM, rather than orchestrate their own business logic. I guess this is why you think flink-agent as being LLM-centric. <img width="479" height="356" alt="image" src="https://github.com/user-attachments/assets/d09b8ce1-73c7-43c1-acce-05fae7d4805b" /> Building on this, I've rethought about flink-agent's current APIs and execution model, and arrived at conclusions very similar to the "callable resource" concept. We should provide users with a new request-response style interaction paradigm for orchestration, rather than being limited to event-triggered flows. LLM calls and subagents naturally fit the former, while the latter still holds value for decoupled orchestration and flexible subscription. The two paradigms can complement each other, and the event-driven approach can still be used for orchestrating complex subagent workflows. Users would interact via a new `call` + `await` interface. At the implementation level, we can wrap subagents, LLMs, and other callable resources as Actions, automatically orchestrating their request/response flows: `call` sends the request (with framework-provided request wrapper, response dispatcher, and completion signaling), while `await` waits for the completion signal and retrieves the result, similar to execute_async. This approach can reuse most of our existing capabilities, including supporting parallel subagent invocations by executing actions in parallel, which is already validated. (Though minor enhancements are still needed; I'll raise a separate discussion on that.) 2. Definition and Execution of the Subagent Itself Building on the foundation above, how users define and use subagents becomes clear: a subagent can be as simple as a single Action, or a complex workflow orchestrated via event subscription; at runtime, it can be wrapped as a callable resource and directly called from the main agent, while the framework internally continues to use event-based subscription and scheduling. However, a subagent entails more than just executing an Action or Action chain. It may also require: isolated context, an independent toolset, specialized prompts, dedicated compute resources, and more. I haven't deeply analyzed the requirements specific to subagents yet. Please feel free to share your ideas. Regarding subagent execution: based on the approach outlined in section 1, we can already run subagents within the same TaskManager. However, due to the GIL, this model cannot support multiple subagents running concurrently. This may suffice for simple, LLM-centric logic, but for more complex scenarios, we likely need to run subagents in isolated processes or dedicated external resources—to prevent subagents from affecting the main agent's stability or competing for its compute resources. Currently, the RPC Operator planned in FLIP-577 appears to be a promising option. As Flink's new infrastructure for AI workloads, it enables unified lifecycle and resource management at the job level, while supporting flexible, independent scaling, fault tolerance, and targeted communication optimizations. We can keep an eye on it. GitHub link: https://github.com/apache/flink-agents/discussions/660#discussioncomment-16902363 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
