Hi Robert,

Thanks for the careful read — these are exactly the right questions to push
on. Let me address them in order, and then come back to your overall
framing point at the end.

   1. RpcOperator vs. AsyncOperator

These solve different problems. AsyncOperator is a programming model for
non-blocking I/O within a Task — the async UDF still lives in the same
operator chain and shares the Task's resource envelope and lifecycle with
its upstream/downstream CPU operators.

RpcOperator is a deployment primitive: the GPU compute is split out of the
data-plane topology and runs as an independently scheduled, scaled, and
recovered service. The two are complementary — you would still use async
semantics to call an RpcOperator from a CPU operator.

The motivation is specific to AI workloads:

Old: Source -> StreamingOperator(GPU+CPU) -> Sink New: Source ->
StreamingOperator(CPU) --rpc--> RpcOperator(GPU) -> Sink

Three workload characteristics drive this split:

   -

   Failover cost is high. A single inference can take minutes (e.g., a long
   reasoning trace on a math problem can exceed 10 minutes). Combined with the
   data inflation typical of multimodal pipelines — inputs are often just URIs
   to images or videos — replaying after a failure is extremely expensive.
   (Buffer debloating helps but doesn't fully solve it in many complex
   situation).
   -

   Non-disruptive elastic scaling is a hard requirement. GPU is expensive
   and scarce; users cannot tolerate idle GPUs, and when GPU capacity becomes
   available it must be absorbed immediately, also without disrupting the
   stream. CPU parallelism may need to adjust along with GPU scaling.
   Achieving this is only possible if GPU compute is decoupled from the CPU
   data path — otherwise their lifecycles are forced to move in lockstep.
   -

   A large class of AI topologies has no KeyBy. Without KeyBy there is no
   state migration, which is precisely what makes independent CPU/GPU scaling
   tractable in the first place.


   2. Non-Disruptive Scaling and Exactly-Once

Good question — let me be concrete. The mechanism is handover via Pipeline
Region, not in-place reconfiguration:

   -

   Scale-out: a new Pipeline Region is created and initialized first; the
   switchover happens on a checkpoint boundary, with the old Region draining
   its in-flight data before being released. Exactly-once is preserved because
   the cutover is anchored on a globally consistent checkpoint, just as it is
   during normal recovery.
   -

   Scale-in: the target instances are deregistered from the load balancer
   first, in-flight requests are drained, and resources are released after the
   next successful checkpoint.

This matters specifically for AI workloads because GPU is expensive and
scarce — idle capacity can't be tolerated, and newly available capacity has
to be absorbed immediately without disrupting the stream. The traditional
stop-and-restart path is unacceptable when a single in-flight inference can
take minutes.

   3. On "engine improvements vs. AI-native"

This is a fair framing question and I want to address it directly. You're
right that, viewed in isolation, "make checkpoints faster" and "scale
without restart" look like generic engine improvements. The reason they are
grouped under this umbrella is that AI workload characteristics push these
mechanisms past their current breaking points, in ways that BI workloads
did not:

   -

   Per-record processing cost is 3-4 orders of magnitude higher than BI
   (seconds-to-minutes per inference vs. microseconds per row), so the cost of
   any failover or restart is qualitatively different.
   -

   Spot-based GPU deployment makes resource churn frequent rather than
   exceptional.
   -

   Asynchronous, long-tailed inference operators expose mailbox/UC
   limitations that simply weren't hit under traditional row-oriented
   workloads.

So I'd frame these less as "AI features" and more as "runtime mechanisms
whose existing implementations are adequate for BI but become bottlenecks
under AI workloads." Happy to make this motivation more explicit in the
umbrella text if it would help.

   4. Connector API for Multimodal Source/Sink

Leonard has already replied on this thread with the rationale, so I won't
repeat it here. The detailed design will be developed in the sub-FLIP.

Thanks again — happy to keep digging on any of these.

Best, Guowei

Reply via email to