GitHub user weiqingy created a discussion: [Discussion] Supervisor + Sub-Agent 
Orchestration: Concrete Multi-Agent Use Cases for Event-Driven Agents

This thread proposes a concrete use case and primitive set for multi-agent 
orchestration — directly responding to @xintongsong's question in 
[#516](https://github.com/apache/flink-agents/discussions/516):

> *"How important is it to support multi-agent system for event-driven agents? 
> What are the concrete use cases? And how do we build a multi-agent system on 
> Flink's streaming engine?"*

It builds on the gap analysis @thefalc opened in 
[#84](https://github.com/apache/flink-agents/discussions/84) (which identified 
"agent self-reflection / output evaluation" and "orchestrator-worker / 
hierarchical patterns" as core multi-agent gaps but didn't propose a specific 
primitive set) and the foundations being laid by 
[#429](https://github.com/apache/flink-agents/discussions/429) (async 
execution) and [#598](https://github.com/apache/flink-agents/discussions/598) 
(durable execution reconcile).

### Motivation: capability orchestration, not LLM chaining

Internal production discussions at my company landed on a framing worth 
surfacing:

> **"Agentic" should not mean "calling LLMs in a loop." It should mean 
> autonomously orchestrating heterogeneous capabilities — LLMs, internal 
> services, REST APIs, ML models, *other agents* — based on reasoning. Any 
> text-in / text-out service should be a peer to an LLM in the orchestration 
> graph.**

Today flink-agents leans LLM-centric in naming, examples, and resource model 
(`CHAT_MODEL` is first-class; everything else is "a tool"). This framing limits 
perceived applicability — users see "Flink + LLM calls" rather than "Flink for 
autonomous capability orchestration." That's a positioning weakness; 
capability-agnostic framing is what enterprise teams want.

### Concrete use case: supervisor + sub-agent with iterative refinement

A pattern repeatedly requested by enterprise teams — and well-established in 
the Python ecosystem (LangGraph `create_supervisor`, CrewAI hierarchical mode, 
AutoGen GroupChat).

**Example: real-time customer support ticket triage on a Kafka stream**

```
Kafka ticket stream
        │
        ▼
┌────────────────┐
│   Supervisor   │  reads ticket, decides which sub-agent to delegate
│     Agent      │
└────────┬───────┘
         │
   ┌─────┴──────┬──────────────┬───────────────┐
   ▼            ▼              ▼               ▼
Classifier   Knowledge      Ticket-API     External Agent
 Sub-Agent   Search          (REST,         (LangChain
 (LLM)       Sub-Agent       no LLM)        wrapped as
              (LLM+RAG)                     REST endpoint)
   │            │              │               │
   └────────────┴──────────────┴───────────────┘
                       │
                       ▼
                Supervisor evaluates:
                "Is this answer complete & accurate?"
                       │
                ┌──────┴───────┐
                │ NO           │ YES
                ▼              ▼
        Refine prompt    Emit response
        with feedback    to output topic
        → loop back
        (bounded by max rounds /
         quality threshold / cost budget)
```

Key properties:

- **Heterogeneous sub-agents** — LLM, RAG, REST service, external agent — look 
the same to the supervisor.
- **Iterative refinement** — supervisor judges sub-agent output and re-prompts 
with feedback context until satisfied.
- **Streaming-native** — runs continuously over a Kafka stream, not per-request.
- **Durable** — multi-round loop survives task manager crashes via Flink keyed 
state + checkpoints.

### Why this pattern fits flink-agents (vs LangGraph / CrewAI)

LangGraph already does supervisor + refinement very well — *per request, in one 
Python process, with DB-backed checkpointing*. flink-agents' opportunity is the 
**same pattern over continuous streaming inputs with distributed exactly-once 
durability**. A multi-round refinement loop processing millions of events per 
day, surviving node failures, is something neither LangGraph nor CrewAI can 
offer. That answers @xintongsong's question on "how is it different on a 
streaming engine."

### Primitives to discuss

I checked all open PRs/issues — nothing addresses these. ResourceType has 
`CHAT_MODEL`, `TOOL`, `MCP_SERVER`, `SKILLS` but no abstraction for "another 
agent" or "remote text service as peer to LLM." ReActAgent is single-agent 
only. No correlation/reply primitives for cross-job calls.

The list below is a starting proposal — **the goal of this thread is to decide 
which of these are must-have to unblock the use case, which can wait, and which 
are out of scope.**

**Tier 1**

1. **Unified callable resource type** — one abstraction subsuming Tool + REST 
service + sub-agent (sync HTTP, async Kafka, MCP). Supervisor shouldn't care 
about wire protocol. Directly addresses the capability-agnostic framing.

2. **Async cross-job RPC pattern** — correlation IDs, reply topics, timeouts, 
retries, in-flight state in Flink keyed state. This is the 
**streaming-durability differentiator** — Flink already gives us durable state 
+ exactly-once; we should expose it as a clean "delegate to another flink-agent 
and await reply" primitive instead of forcing every team to reinvent the 
plumbing.

**Tier 2** (likely starts as recipes, promoted to primitives after validation)

3. **Judge / critic step** — document the pattern first with example code; 
formalize only after multiple users converge on the shape. Avoids 
over-abstraction.

4. **Richer loop termination** — quality threshold, budget (tokens, wall-clock, 
rounds), not just `AGENT_MAX_ITERATIONS`.

**Tier 0** (free wins, can do now)

5. **Reframe docs/examples** — lead with "agents orchestrate capabilities," not 
"agents call LLMs." Add a multi-agent reference example using *existing* 
primitives (Tool + ReActAgent + Kafka request/reply) to demonstrate the pattern 
works today, even if rough.

### Open questions for the community

1. Is this use case compelling enough to move multi-agent from **"Not likely"** 
to **"If possible"** in the [0.3 roadmap 
(#516)](https://github.com/apache/flink-agents/discussions/516) (or 0.4)? The 
supervisor pattern has broad industry validation; the streaming-durable variant 
is uniquely ours.
2. Should the supervisor pattern be built first as a **recipe/example** using 
existing primitives, validated, and only then formalized into framework 
primitives? I'd argue yes — avoids LangChain's "too many abstractions too fast" 
mistake.
3. Should we add new resource types (`SubAgent`, `RemoteCapability`), or extend 
existing `TOOL` / `MCP_SERVER`? Extending is less disruptive; new types are 
cleaner.
4. Cross-job RPC over Kafka — correlation ID + reply topic is the obvious 
shape, but timer-based timeouts, retries, and dead-lettering deserve a 
dedicated design proposal.

Linking @xintongsong @thefalc @yanand0909 since you've engaged on adjacent 
topics in [#516](https://github.com/apache/flink-agents/discussions/516) / 
[#84](https://github.com/apache/flink-agents/discussions/84).


GitHub link: https://github.com/apache/flink-agents/discussions/660

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to