Hi Robert, Thanks for the careful read — these are exactly the right questions to push on. Let me address them in order, and then come back to your overall framing point at the end.
1. RpcOperator vs. AsyncOperator These solve different problems. AsyncOperator is a programming model for non-blocking I/O within a Task — the async UDF still lives in the same operator chain and shares the Task's resource envelope and lifecycle with its upstream/downstream CPU operators. RpcOperator is a deployment primitive: the GPU compute is split out of the data-plane topology and runs as an independently scheduled, scaled, and recovered service. The two are complementary — you would still use async semantics to call an RpcOperator from a CPU operator. The motivation is specific to AI workloads: Old: Source -> StreamingOperator(GPU+CPU) -> Sink New: Source -> StreamingOperator(CPU) --rpc--> RpcOperator(GPU) -> Sink Three workload characteristics drive this split: - Failover cost is high. A single inference can take minutes (e.g., a long reasoning trace on a math problem can exceed 10 minutes). Combined with the data inflation typical of multimodal pipelines — inputs are often just URIs to images or videos — replaying after a failure is extremely expensive. (Buffer debloating helps but doesn't fully solve it in many complex situation). - Non-disruptive elastic scaling is a hard requirement. GPU is expensive and scarce; users cannot tolerate idle GPUs, and when GPU capacity becomes available it must be absorbed immediately, also without disrupting the stream. CPU parallelism may need to adjust along with GPU scaling. Achieving this is only possible if GPU compute is decoupled from the CPU data path — otherwise their lifecycles are forced to move in lockstep. - A large class of AI topologies has no KeyBy. Without KeyBy there is no state migration, which is precisely what makes independent CPU/GPU scaling tractable in the first place. 2. Non-Disruptive Scaling and Exactly-Once Good question — let me be concrete. The mechanism is handover via Pipeline Region, not in-place reconfiguration: - Scale-out: a new Pipeline Region is created and initialized first; the switchover happens on a checkpoint boundary, with the old Region draining its in-flight data before being released. Exactly-once is preserved because the cutover is anchored on a globally consistent checkpoint, just as it is during normal recovery. - Scale-in: the target instances are deregistered from the load balancer first, in-flight requests are drained, and resources are released after the next successful checkpoint. This matters specifically for AI workloads because GPU is expensive and scarce — idle capacity can't be tolerated, and newly available capacity has to be absorbed immediately without disrupting the stream. The traditional stop-and-restart path is unacceptable when a single in-flight inference can take minutes. 3. On "engine improvements vs. AI-native" This is a fair framing question and I want to address it directly. You're right that, viewed in isolation, "make checkpoints faster" and "scale without restart" look like generic engine improvements. The reason they are grouped under this umbrella is that AI workload characteristics push these mechanisms past their current breaking points, in ways that BI workloads did not: - Per-record processing cost is 3-4 orders of magnitude higher than BI (seconds-to-minutes per inference vs. microseconds per row), so the cost of any failover or restart is qualitatively different. - Spot-based GPU deployment makes resource churn frequent rather than exceptional. - Asynchronous, long-tailed inference operators expose mailbox/UC limitations that simply weren't hit under traditional row-oriented workloads. So I'd frame these less as "AI features" and more as "runtime mechanisms whose existing implementations are adequate for BI but become bottlenecks under AI workloads." Happy to make this motivation more explicit in the umbrella text if it would help. 4. Connector API for Multimodal Source/Sink Leonard has already replied on this thread with the rationale, so I won't repeat it here. The detailed design will be developed in the sub-FLIP. Thanks again — happy to keep digging on any of these. Best, Guowei
