My 2c — this seems like 3 mostly unrelated proposals that should be separated out. Caching of schema information in the Spark Connect client seems uncontroversial (as long as the behavior is controllable / gated behind a flag), and AFAICT, addresses your concerns.
Batch resolution is interesting and I can imagine use cases, but it would require new APIs (AFAICT) and user logic changes, which doesn’t seem to solve your initial problem statement of performance degradation when migrating from Classic to Connect. Asynchronous resolution is a big can of worms that can fundamentally change the expected behavior of the APIs. I think you will have more luck if you narrowly scope this proposal to just client-side caching. On Fri, Jan 23, 2026 at 8:09 PM vaquar khan <[email protected]> wrote: > Hi Herman, > > Sorry for the delay in getting back to you. I’ve finished the > comprehensive benchmarking > <https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0>for > the "*Death by 1000 RPCs*" bottleneck in Spark Connect and have updated > the SPIP draft > <https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0> > and JIRA SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163> > ("Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect") > with the findings. > > As we’ve discussed, the transition to the gRPC client-server model > introduced a significant latency penalty for metadata-heavy workloads. My > research into a Client-Side Metadata Skip-Layer, using a deterministic Plan > ID strategy, shows that we can bypass these physical network constraints. > The performance gains actually ended up exceeding our initial projections. > > > *Here are the key results from the testing (conducted on Spark > 4.0.0-preview):* > - Baseline Latency Confirmed: We measured a consistent 277 ms latency > for a single df.columns RPC call. Our analysis shows this is split roughly > between Catalyst analysis (~27%) and network RTT/serialization (~23%). > > The Uncached Bottleneck: For a sequence of 50 metadata checks—which is > common in complex ETL loops or frameworks like Great Expectations—the > uncached architecture resulted in 13.2 seconds of blocking overhead. > > - Performance with Caching: With the SPARK-45123 Plan ID caching > enabled, that same 50-call sequence finished in just 0.25 seconds. > > - Speedup: This is a *51× speedup for 50 operations*, and my > projections show this scaling to a *108× speedup for 100 operations*. > > - RPC Elimination: By exploiting DataFrame immutability and using Plan > ID invalidation for correctness, we effectively eliminated 99% of metadata > RPCs in these iterative flows. > > This essentially solves the "Shadow Schema" problem where developers were > being forced to manually track columns in local lists just to keep their > notebooks responsive. > > Updated SPIP Draft:( > https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0 > ) > > Please take a look when you have a moment. If these results look solid to > you, I’d like to move this toward a vote. > > Best regards, > > Viquar Khan > > On Wed, 7 Jan 2026 at 09:38, vaquar khan <[email protected]> wrote: > >> Hi Herman, >> >> I have enabled the comments and appreciate your feedback. >> >> Regards, >> Vaquar khan >> >> On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev < >> [email protected]> wrote: >> >>> Hi Vaquar, >>> >>> Can you enable comments on the doc? >>> >>> In general I am not against making improvements in this area. However >>> the devil is very much in the details here. >>> >>> Cheers, >>> Herman >>> >>> On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]> >>> wrote: >>> >>>> Hi everyone, >>>> >>>> I’ve been following the rapid maturation of *Spark Connect* in the 4.x >>>> release and have been identifying areas where remote execution can reach >>>> parity with Spark Classic . >>>> >>>> While the remote execution model elegantly decouples the client from >>>> the JVM, I am concerned about a performance regression in interactive and >>>> high-complexity workloads. >>>> >>>> Specifically, the current implementation of *Eager Analysis* ( >>>> df.columns, df.schema, etc.) relies on synchronous gRPC round-trips >>>> that block the client thread. In environments with high network latency, >>>> these blocking calls create a "Death by 1000 RPCs" bottleneck—often forcing >>>> developers to write suboptimal, "Connect-specific" code to avoid metadata >>>> requests . >>>> >>>> *Proposal*: >>>> >>>> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy >>>> Prefetching) within the Spark Connect protocol. Key pillars include: >>>> >>>> 1. >>>> >>>> *Plan-Piggybacking:* Allowing the *SparkConnectService* to return >>>> resolved schemas of relations during standard plan execution. >>>> 2. >>>> >>>> *Local Schema Cache:* A configurable client-side cache in the >>>> *SparkSession* to store resolved schemas. >>>> 3. >>>> >>>> *Batched Analysis API:* An extension to the *AnalyzePlan* protocol >>>> to allow schema resolution for multiple DataFrames in a single batch >>>> call. >>>> >>>> This shift would ensure that Spark Connect provides the same "fluid" >>>> interactive experience as Spark Classic, removing the $O(N)$ network >>>> latency overhead for metadata-heavy operations . >>>> >>>> I have drafted a full SPIP document ready for review , which includes >>>> the proposed changes for the *SparkConnectService* and *AnalyzePlan* >>>> handlers. >>>> >>>> *SPIP Doc:* >>>> >>>> >>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing >>>> >>>> Before I finalize the JIRA, has there been any recent internal >>>> discussion regarding metadata prefetching or batching analysis requests in >>>> the current Spark Connect roadmap ? >>>> >>>> >>>> Regards, >>>> Vaquar Khan >>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>> >>> >> >> -- >> Regards, >> Vaquar Khan >> >> > > -- > Regards, > Vaquar Khan > >
