Hi everyone, I’ve been following the rapid maturation of *Spark Connect* in the 4.x release and have been identifying areas where remote execution can reach parity with Spark Classic .
While the remote execution model elegantly decouples the client from the JVM, I am concerned about a performance regression in interactive and high-complexity workloads. Specifically, the current implementation of *Eager Analysis* (df.columns, df.schema, etc.) relies on synchronous gRPC round-trips that block the client thread. In environments with high network latency, these blocking calls create a "Death by 1000 RPCs" bottleneck—often forcing developers to write suboptimal, "Connect-specific" code to avoid metadata requests . *Proposal*: I propose we introduce a Client-Side Metadata Skip-Layer (Lazy Prefetching) within the Spark Connect protocol. Key pillars include: 1. *Plan-Piggybacking:* Allowing the *SparkConnectService* to return resolved schemas of relations during standard plan execution. 2. *Local Schema Cache:* A configurable client-side cache in the *SparkSession* to store resolved schemas. 3. *Batched Analysis API:* An extension to the *AnalyzePlan* protocol to allow schema resolution for multiple DataFrames in a single batch call. This shift would ensure that Spark Connect provides the same "fluid" interactive experience as Spark Classic, removing the $O(N)$ network latency overhead for metadata-heavy operations . I have drafted a full SPIP document ready for review , which includes the proposed changes for the *SparkConnectService* and *AnalyzePlan* handlers. *SPIP Doc:* https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing Before I finalize the JIRA, has there been any recent internal discussion regarding metadata prefetching or batching analysis requests in the current Spark Connect roadmap ? Regards, Vaquar Khan https://www.linkedin.com/in/vaquar-khan-b695577/
