Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

vaquar khan Fri, 13 Feb 2026 23:41:43 -0800

Hi everyone,

Following up on the discussion regarding the narrowed scope (Phase 1:
Client-Side Plan-ID Caching).


*To recap,* I have updated the SPIP to address the feedback:

*Scope*: Narrowed strictly to client-side caching (per Erik's suggestion).

*Technical logic:* Pivoted the caching strategy to target schema-mutating
transformations (e.g., select, withColumn) rather than filter/limit, as
verified by Ruifeng’s findings.
:
If there are no further technical questions or objections regarding this
updated scope, I plan to start the formal voting thread next Tuesday (Feb
17).

Thanks again to Herman, Erik, and Ruifeng for the discussion here, and to
Holden for their offline guidance and support

Regards,
Viquar Khan
https://www.linkedin.com/in/vaquar-khan-b695577/


On Mon, 9 Feb 2026 at 21:38, vaquar khan <[email protected]> wrote:

> Hi Erik, Ruifeng, Herman,
>
> Thanks to the suggestion on narrowing the scope, it helped focus the
> design on a stable Phase 1
> Ruifeng, I’ve updated the doc to clarify the distinction between existing
> schema propagation and the structural RPC bottleneck in mutating
> transformations.
>
> Herman, would you be open to formally shepherding this SPIP toward a vote?
>
>
> I’d like to target the upcoming 4.x releases if possible.
> Updated SPIP:
> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0
>
> Regards,
> Viquar khan
>
> On Sun, 8 Feb 2026 at 11:03, vaquar khan <[email protected]> wrote:
>
>> Hi Ruifeng,
>>
>> You are correct regarding filter and limit—I verified in dataframe.py
>> that these operators do propagate _cached_schema correctly. Thanks for
>> flagging that.However, this investigation helped isolate the actual
>> structural bottleneck: schema-mutating transformations.
>>
>> Operations like select, withColumn, and join fundamentally alter the plan
>> structure and cannot use simple instance-propagation. Currently, a loop
>> executing df.select(...) forces a blocking 277ms RPC for every iteration
>> because the client treats every new DataFrame instance as a cold start.
>>
>> This is where the Plan-ID architecture is essential. By hashing the
>> unresolved plan, we can detect that select("col") produces a deterministic
>> schema, even across different DataFrame instances.
>>
>> I’ve updated the SPIP to strictly target these unoptimized
>> schema-mutating workloads. our SIP  is  critical for interactive
>> performance in data quality and ETL frameworks.
>>
>> Updated doc:
>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0
>>
>> Regards,
>> Vaquar Khan
>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>
>> On Sat, 7 Feb 2026 at 22:49, Ruifeng Zheng <[email protected]> wrote:
>>
>>> Hi Vaquar,
>>>
>>>
>>> > every time a user does something like.filter() or.limit(), it creates
>>> a new DataFrame instance with an empty cache. This forces a fresh 277 ms
>>> AnalyzePlan RPC even if the schema is exactly the same as the parent.
>>>
>>> Is this true? I think the cached schema is already propagated in
>>> operators like `filter` and `limit`, see
>>>
>>>
>>> https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L560-L567
>>>
>>> https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L792-L795
>>>
>>>
>>> On Sun, Feb 8, 2026 at 4:44 AM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Hi Erik and Herman,
>>>>
>>>> Thanks for the feedback on narrowing the scope. I have updated the SPIP
>>>> (SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>) to
>>>> focus strictly on Phase 1: Client-Side Plan-ID Caching.
>>>>
>>>> I spent some time looking at the pyspark.sql.connect client code and
>>>> found that while there is already a cache check in dataframe.py:1898, it is
>>>> strictly instance-bound. This explains the "Death by 1000 RPCs" bottleneck
>>>> we are seeing: every time a user does something like.filter() or.limit(),
>>>> it creates a new DataFrame instance with an empty cache. This forces a
>>>> fresh 277 ms AnalyzePlan RPC even if the schema is exactly the same as the
>>>> parent.
>>>>
>>>> In my testing on Spark 4.0.0-preview, a sequence of 50 metadata calls
>>>> on derived DataFrames took 13.2 seconds. With the proposed Plan-ID cache,
>>>> that same sequence dropped to 0.25 seconds—a 51x speedup.
>>>>
>>>> By focusing only on this caching layer, we can solve the primary
>>>> performance issue with zero protocol changes and no impact on the
>>>> user-facing API. I've moved the more complex ideas like background
>>>> asynchronicity—which Erik noted as a "can of worms" regarding
>>>> consistency—to a future work section to keep this Phase 1 focused and safe.
>>>>
>>>> Updated SPIP:
>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>>>>
>>>> I would appreciate it if you could take a look at this narrowed
>>>> version. Is anyone from the PMC open to shepherding this Phase 1?
>>>>
>>>> Regards,
>>>> Vaquar Khan
>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>
>>>> On Sun, 25 Jan 2026 at 10:53, Erik Krogen <[email protected]> wrote:
>>>>
>>>>> My 2c — this seems like 3 mostly unrelated proposals that should be
>>>>> separated out. Caching of schema information in the Spark Connect client
>>>>> seems uncontroversial (as long as the behavior is controllable / gated
>>>>> behind a flag), and AFAICT, addresses your concerns.
>>>>>
>>>>> Batch resolution is interesting and I can imagine use cases, but it
>>>>> would require new APIs (AFAICT) and user logic changes, which doesn’t seem
>>>>> to solve your initial problem statement of performance degradation when
>>>>> migrating from Classic to Connect.
>>>>>
>>>>> Asynchronous resolution is a big can of worms that can fundamentally
>>>>> change the expected behavior of the APIs.
>>>>>
>>>>> I think you will have more luck if you narrowly scope this proposal to
>>>>> just client-side caching.
>>>>>
>>>>> On Fri, Jan 23, 2026 at 8:09 PM vaquar khan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Herman,
>>>>>>
>>>>>> Sorry for the delay in getting back to you. I’ve finished the
>>>>>> comprehensive benchmarking
>>>>>> <https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0>for
>>>>>> the "*Death by 1000 RPCs*" bottleneck in Spark Connect and have
>>>>>> updated the SPIP draft
>>>>>> <https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0>
>>>>>> and JIRA SPARK-55163
>>>>>> <https://issues.apache.org/jira/browse/SPARK-55163> ("Asynchronous
>>>>>> Metadata Resolution & Lazy Prefetching for Spark Connect") with the
>>>>>> findings.
>>>>>>
>>>>>> As we’ve discussed, the transition to the gRPC client-server model
>>>>>> introduced a significant latency penalty for metadata-heavy workloads. My
>>>>>> research into a Client-Side Metadata Skip-Layer, using a deterministic 
>>>>>> Plan
>>>>>> ID strategy, shows that we can bypass these physical network constraints.
>>>>>> The performance gains actually ended up exceeding our initial 
>>>>>> projections.
>>>>>>
>>>>>>
>>>>>> *Here are the key results from the testing (conducted on Spark
>>>>>> 4.0.0-preview):*
>>>>>>     - Baseline Latency Confirmed: We measured a consistent 277 ms
>>>>>> latency for a single df.columns RPC call. Our analysis shows this is 
>>>>>> split
>>>>>> roughly between Catalyst analysis (~27%) and network RTT/serialization
>>>>>> (~23%).
>>>>>>
>>>>>>     The Uncached Bottleneck: For a sequence of 50 metadata
>>>>>> checks—which is common in complex ETL loops or frameworks like Great
>>>>>> Expectations—the uncached architecture resulted in 13.2 seconds of 
>>>>>> blocking
>>>>>> overhead.
>>>>>>
>>>>>>     - Performance with Caching: With the SPARK-45123 Plan ID caching
>>>>>> enabled, that same 50-call sequence finished in just 0.25 seconds.
>>>>>>
>>>>>>     - Speedup: This is a *51× speedup for 50 operations*, and my
>>>>>> projections show this scaling to a *108× speedup for 100 operations*.
>>>>>>
>>>>>>     - RPC Elimination: By exploiting DataFrame immutability and using
>>>>>> Plan ID invalidation for correctness, we effectively eliminated 99% of
>>>>>> metadata RPCs in these iterative flows.
>>>>>>
>>>>>> This essentially solves the "Shadow Schema" problem where developers
>>>>>> were being forced to manually track columns in local lists just to keep
>>>>>> their notebooks responsive.
>>>>>>
>>>>>> Updated SPIP Draft:(
>>>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0
>>>>>> )
>>>>>>
>>>>>> Please take a look when you have a moment. If these results look
>>>>>> solid to you, I’d like to move this toward a vote.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Viquar Khan
>>>>>>
>>>>>> On Wed, 7 Jan 2026 at 09:38, vaquar khan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi  Herman,
>>>>>>>
>>>>>>> I have enabled the comments and appreciate your feedback.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Vaquar khan
>>>>>>>
>>>>>>> On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Vaquar,
>>>>>>>>
>>>>>>>> Can you enable comments on the doc?
>>>>>>>>
>>>>>>>> In general I am not against making improvements in this area.
>>>>>>>> However the devil is very much in the details here.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Herman
>>>>>>>>
>>>>>>>> On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I’ve been following the rapid maturation of *Spark Connect* in
>>>>>>>>> the 4.x release and have been identifying areas where remote 
>>>>>>>>> execution can
>>>>>>>>> reach parity with Spark Classic .
>>>>>>>>>
>>>>>>>>> While the remote execution model elegantly decouples the client
>>>>>>>>> from the JVM, I am concerned about a performance regression in 
>>>>>>>>> interactive
>>>>>>>>> and high-complexity workloads.
>>>>>>>>>
>>>>>>>>> Specifically, the current implementation of *Eager Analysis* (
>>>>>>>>> df.columns, df.schema, etc.) relies on synchronous gRPC
>>>>>>>>> round-trips that block the client thread. In environments with high 
>>>>>>>>> network
>>>>>>>>> latency, these blocking calls create a "Death by 1000 RPCs"
>>>>>>>>> bottleneck—often forcing developers to write suboptimal, 
>>>>>>>>> "Connect-specific"
>>>>>>>>> code to avoid metadata requests .
>>>>>>>>>
>>>>>>>>> *Proposal*:
>>>>>>>>>
>>>>>>>>> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy
>>>>>>>>> Prefetching) within the Spark Connect protocol. Key pillars include:
>>>>>>>>>
>>>>>>>>>    1.
>>>>>>>>>
>>>>>>>>>    *Plan-Piggybacking:* Allowing the *SparkConnectService* to
>>>>>>>>>    return resolved schemas of relations during standard plan 
>>>>>>>>> execution.
>>>>>>>>>    2.
>>>>>>>>>
>>>>>>>>>    *Local Schema Cache:* A configurable client-side cache in the
>>>>>>>>>    *SparkSession* to store resolved schemas.
>>>>>>>>>    3.
>>>>>>>>>
>>>>>>>>>    *Batched Analysis API:* An extension to the *AnalyzePlan*
>>>>>>>>>    protocol to allow schema resolution for multiple DataFrames in a 
>>>>>>>>> single
>>>>>>>>>    batch call.
>>>>>>>>>
>>>>>>>>> This shift would ensure that Spark Connect provides the same
>>>>>>>>> "fluid" interactive experience as Spark Classic, removing the
>>>>>>>>> $O(N)$ network latency overhead for metadata-heavy operations .
>>>>>>>>>
>>>>>>>>> I have drafted a full SPIP document ready for review  , which
>>>>>>>>> includes the proposed changes for the *SparkConnectService* and
>>>>>>>>> *AnalyzePlan* handlers.
>>>>>>>>>
>>>>>>>>> *SPIP Doc:*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>>>>>>>>>
>>>>>>>>> Before I finalize the JIRA, has there been any recent internal
>>>>>>>>> discussion regarding metadata prefetching or batching analysis 
>>>>>>>>> requests in
>>>>>>>>> the current Spark Connect roadmap ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Vaquar Khan
>>>>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Vaquar Khan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Vaquar Khan
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>>
>>>>
>>>
>>> --
>>> Ruifeng Zheng
>>> E-mail: [email protected]
>>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>>
>>

Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

Reply via email to