Re: [D] [Discussion] Supervisor + Sub-Agent Orchestration: Concrete Multi-Agent Use Cases for Event-Driven Agents [flink-agents]

via GitHub Tue, 12 May 2026 04:46:49 -0700


GitHub user xintongsong added a comment to the discussion: [Discussion] 
Supervisor + Sub-Agent Orchestration: Concrete Multi-Agent Use Cases for 
Event-Driven Agents


Hi @weiqingy,

Thanks for putting this thread together — it does a good job laying out 
concrete use cases and the open questions to discuss. On the big picture: I 
think sub-agent support is worth pursuing.

That "Not likely" line of mine in #516 is actually a February take. Since then 
we've had a lot of conversations with team members and some users we're talking 
to, and my view on sub-agents has evolved. As I mentioned in the email 
@wenjin272 linked, if Flink Agents can support sub-agents, I see three benefits:

- Better compatibility with the broader agent skill ecosystem
- Better context isolation
- Independently scalable shared resource pools

So putting this on the 0.3 / 0.4 roadmap makes sense.

---

Two quick questions before going further:

1. How are the Tier 0 / 1 / 2 priorities organized? Tier 2 and Tier 0 both have 
short notes ("likely starts as recipes", "free wins"), but Tier 1 doesn't — 
curious about your original intent.
2. By "SubAgent", do you mean another standalone Flink Agents job? I read yes 
from "cross-job RPC over Kafka" and "delegate to another flink-agent and await 
reply", but want to confirm since it affects the where-it-runs discussion below.

---

Now to share how we've been thinking about sub-agents.

Before getting into the supervisor + sub-agent pattern itself, I think it's 
worth pinning down where sub-agents actually run. There are roughly three 
possibilities (the first one splits into two sub-cases):

**1. Supervisor and sub-agents in the same Flink Agents job**

- **1.a In the same operator**: Today's Flink Agents can already support this: 
you write the supervisor and each sub-agent as separate actions, each with its 
own prompt and toolset, and use events to invoke a sub-agent and return 
results. The support today isn't friendly though; users have to wire a lot of 
things themselves. And this approach doesn't get you the "independently 
scalable shared resource pool" benefit.
- **1.b In different operators**: Sub-agent resource pools can be scaled 
independently, and a sub-agent can be shared across multiple supervisors in the 
same job. To support this, we need a way to do request & response loops between 
two Flink operators. The RpcOperator from FLIP-577 looks like a nice fit here.

**2. Supervisor and sub-agents in different Flink Agents jobs**

I'm not fully sold on the use case for this shape yet. Multiple operators and 
agents inside a single Flink job share a lifecycle and deployment story, so if 
a sub-agent is required for the supervisor to run, putting them in the same job 
seems more natural.

My guess is you're proposing Kafka between Flink Agents jobs to handle the case 
where a sub-agent isn't available when the supervisor calls it? But what's the 
advantage over just colocating them in the same job?

That said, I can think of a legitimate scenario for multi-agent collaboration 
across jobs: each agent owns a dedicated responsibility along with the data / 
state it needs, and processes tasks coming from various requesters. Think of a 
company with separate procurement, sales, warehousing, logistics, and 
after-sales departments, where orders flow between departments without always 
going through a supervisor. This looks more like a system of independent 
services, each with its own mailbox / request queue: upstream drops tasks into 
Kafka, the current service picks them up, processes them, and forwards results 
to the downstream mailbox. Kafka fits naturally here, but this is a different 
architecture from supervisor-subagent.

The other way around — connecting supervisor and sub-agents via Kafka — means 
each supervisor / sub-agent pair needs two queues (input and output). That 
feels complex and not very natural.

So on your Tier 1 item 2 (async cross-job RPC pattern) and open question 4 
(cross-job RPC over Kafka design), I'd suggest first clarifying the use case 
for cross-job, then discussing the concrete design.

**3. Supervisor in a Flink Agents job, sub-agents served by an external RPC 
framework**

This is really just an async remote call inside a custom action — the server 
side could be an agent or any other RPC / HTTP service, doesn't matter.

---

On our team's end, the main focus is 1.b (same job, different operators). This 
depends on RpcOperator, so it's unlikely to land in 0.3.

In parallel, for cases where the sub-agent doesn't have heavy workload, I think 
there's a nice opportunity for the community to make 1.a (same job, same 
operator) more user-friendly: a built-in supervisor + sub-agent implementation 
along the lines of ReActAgent — possibly even by extending ReActAgent directly 
— to cut down on what users have to wire by hand. This looks doable within 0.3. 
Once RpcOperator is ready, the same built-in can be extended to offer a choice 
between "sub-agent in-operator" and "sub-agent in an independent resource pool".

This built-in implementation incidentally also covers two of your Tier 2 items:

- **Judge / critic step** (item 3): can be a built-in step inside this 
implementation
- **Richer loop termination** (item 4): quality threshold, token / wall-clock / 
round budgets can all be exposed as config

So compared to "start as recipe / example" in Tier 0 / Tier 2, going one step 
further and providing a built-in implementation feels better to me — friendlier 
for users, and the community can iterate on a single shared implementation.

---

To wrap up, a few points on specific items in the proposal:

**On "flink-agents leans LLM-centric"** (Motivation / Tier 0 item 5): I'd push 
back a bit here. Looking at the single-agent orchestration design today, Flink 
Agents is really workflow orchestration with action and event as the basic 
units — calling an LLM, calling a tool, searching a vector store are all just 
different action types, peers to each other. Once multi-agent support lands, 
it'll extend to orchestration between agents. So I don't see the current design 
itself as LLM-centric. That said, if docs or examples are giving that 
impression, that's a separate matter. Happy to look at specific descriptions 
you find misleading and figure out how to improve them.

**On sub-agent-specific primitives** (open question 3): Agreed they're needed, 
but doing it well takes careful design, and 0.3 looks tight. Punting to the 
next release cycle feels safer. In the meantime, introducing the sub-agent 
concept on top of the ReActAgent built-in is a safer move — once Flink Agents 
API formally introduces sub-agent primitives, we just update the built-in, and 
users won't notice.

**On a unified callable resource type** (Tier 1 item 1): I'd hold off on this 
for now, no rush to abstract. Tool, REST service, and sub-agent are already 
familiar standalone concepts to both users and models. A unified abstraction 
looks cleaner conceptually, but doesn't really add capability, and the help 
with lowering the learning curve seems limited too. If we later hear users 
actually complaining about switching between the three, we can revisit then.

GitHub link: 
https://github.com/apache/flink-agents/discussions/660#discussioncomment-16891962

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] [Discussion] Supervisor + Sub-Agent Orchestration: Concrete Multi-Agent Use Cases for Event-Driven Agents [flink-agents]

Reply via email to