GitHub user xintongsong added a comment to the discussion: [Discussion]
Supervisor + Sub-Agent Orchestration: Concrete Multi-Agent Use Cases for
Event-Driven Agents
Hi @weiqingy,
Thanks for putting this thread together — it does a good job laying out
concrete use cases and the open questions to discuss. On the big picture: I
think sub-agent support is worth pursuing.
That "Not likely" line of mine in #516 is actually a February take. Since then
we've had a lot of conversations with team members and some users we're talking
to, and my view on sub-agents has evolved. As I mentioned in the email
@wenjin272 linked, if Flink Agents can support sub-agents, I see three benefits:
- Better compatibility with the broader agent skill ecosystem
- Better context isolation
- Independently scalable shared resource pools
So putting this on the 0.3 / 0.4 roadmap makes sense.
---
Two quick questions before going further:
1. How are the Tier 0 / 1 / 2 priorities organized? Tier 2 and Tier 0 both have
short notes ("likely starts as recipes", "free wins"), but Tier 1 doesn't —
curious about your original intent.
2. By "SubAgent", do you mean another standalone Flink Agents job? I read yes
from "cross-job RPC over Kafka" and "delegate to another flink-agent and await
reply", but want to confirm since it affects the where-it-runs discussion below.
---
Now to share how we've been thinking about sub-agents.
Before getting into the supervisor + sub-agent pattern itself, I think it's
worth pinning down where sub-agents actually run. There are roughly three
possibilities (the first one splits into two sub-cases):
**1. Supervisor and sub-agents in the same Flink Agents job**
- **1.a In the same operator**: Today's Flink Agents can already support this:
you write the supervisor and each sub-agent as separate actions, each with its
own prompt and toolset, and use events to invoke a sub-agent and return
results. The support today isn't friendly though; users have to wire a lot of
things themselves. And this approach doesn't get you the "independently
scalable shared resource pool" benefit.
- **1.b In different operators**: Sub-agent resource pools can be scaled
independently, and a sub-agent can be shared across multiple supervisors in the
same job. To support this, we need a way to do request & response loops between
two Flink operators. The RpcOperator from FLIP-577 looks like a nice fit here.
**2. Supervisor and sub-agents in different Flink Agents jobs**
I'm not fully sold on the use case for this shape yet. Multiple operators and
agents inside a single Flink job share a lifecycle and deployment story, so if
a sub-agent is required for the supervisor to run, putting them in the same job
seems more natural.
My guess is you're proposing Kafka between Flink Agents jobs to handle the case
where a sub-agent isn't available when the supervisor calls it? But what's the
advantage over just colocating them in the same job?
That said, I can think of a legitimate scenario for multi-agent collaboration
across jobs: each agent owns a dedicated responsibility along with the data /
state it needs, and processes tasks coming from various requesters. Think of a
company with separate procurement, sales, warehousing, logistics, and
after-sales departments, where orders flow between departments without always
going through a supervisor. This looks more like a system of independent
services, each with its own mailbox / request queue: upstream drops tasks into
Kafka, the current service picks them up, processes them, and forwards results
to the downstream mailbox. Kafka fits naturally here, but this is a different
architecture from supervisor-subagent.
The other way around — connecting supervisor and sub-agents via Kafka — means
each supervisor / sub-agent pair needs two queues (input and output). That
feels complex and not very natural.
So on your Tier 1 item 2 (async cross-job RPC pattern) and open question 4
(cross-job RPC over Kafka design), I'd suggest first clarifying the use case
for cross-job, then discussing the concrete design.
**3. Supervisor in a Flink Agents job, sub-agents served by an external RPC
framework**
This is really just an async remote call inside a custom action — the server
side could be an agent or any other RPC / HTTP service, doesn't matter.
---
On our team's end, the main focus is 1.b (same job, different operators). This
depends on RpcOperator, so it's unlikely to land in 0.3.
In parallel, for cases where the sub-agent doesn't have heavy workload, I think
there's a nice opportunity for the community to make 1.a (same job, same
operator) more user-friendly: a built-in supervisor + sub-agent implementation
along the lines of ReActAgent — possibly even by extending ReActAgent directly
— to cut down on what users have to wire by hand. This looks doable within 0.3.
Once RpcOperator is ready, the same built-in can be extended to offer a choice
between "sub-agent in-operator" and "sub-agent in an independent resource pool".
This built-in implementation incidentally also covers two of your Tier 2 items:
- **Judge / critic step** (item 3): can be a built-in step inside this
implementation
- **Richer loop termination** (item 4): quality threshold, token / wall-clock /
round budgets can all be exposed as config
So compared to "start as recipe / example" in Tier 0 / Tier 2, going one step
further and providing a built-in implementation feels better to me — friendlier
for users, and the community can iterate on a single shared implementation.
---
To wrap up, a few points on specific items in the proposal:
**On "flink-agents leans LLM-centric"** (Motivation / Tier 0 item 5): I'd push
back a bit here. Looking at the single-agent orchestration design today, Flink
Agents is really workflow orchestration with action and event as the basic
units — calling an LLM, calling a tool, searching a vector store are all just
different action types, peers to each other. Once multi-agent support lands,
it'll extend to orchestration between agents. So I don't see the current design
itself as LLM-centric. That said, if docs or examples are giving that
impression, that's a separate matter. Happy to look at specific descriptions
you find misleading and figure out how to improve them.
**On sub-agent-specific primitives** (open question 3): Agreed they're needed,
but doing it well takes careful design, and 0.3 looks tight. Punting to the
next release cycle feels safer. In the meantime, introducing the sub-agent
concept on top of the ReActAgent built-in is a safer move — once Flink Agents
API formally introduces sub-agent primitives, we just update the built-in, and
users won't notice.
**On a unified callable resource type** (Tier 1 item 1): I'd hold off on this
for now, no rush to abstract. Tool, REST service, and sub-agent are already
familiar standalone concepts to both users and models. A unified abstraction
looks cleaner conceptually, but doesn't really add capability, and the help
with lowering the learning curve seems limited too. If we later hear users
actually complaining about switching between the three, we can revisit then.
GitHub link:
https://github.com/apache/flink-agents/discussions/660#discussioncomment-16891962
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]