I have mixed feelings about starting a persistent server perhaps
unexpectedly especially since if it gets in a bad state (say driver JVM
thrashjng on GC) a reasonable user might restart their Python process and
expect it to also kill the server (as it has done until now).

Not saying I’m against this yet just there’s some downsides with changing a
default like this and we’d probably want to be careful with messaging to
users so they don’t get stuck / surpised (and give clear instructions when
connecting to an existing server etc).


Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

On Sun, Jun 21, 2026 at 2:36 PM serge rielau.com <[email protected]> wrote:

> Oh, yes please!
>
>
> On Jun 21, 2026, at 7:57 AM, Nicholas Chammas <[email protected]>
> wrote:
>
> A recent SPIP
> <https://docs.google.com/document/d/1Nphejrf_vh4YRECn0JPgKClqxDS_lB6wufZFJQxyY98/>
>  proposed
> to improve Spark’s performance on small and local datasets. On that SPIP I
> raised a related issue
> <https://docs.google.com/document/d/1Nphejrf_vh4YRECn0JPgKClqxDS_lB6wufZFJQxyY98/edit?disco=AAAB5rOuVBw>
>  that
> I would like to surface here, and that is the time it takes to create a
> Spark session locally.
>
> import timefrom pyspark.sql import SparkSession
>
> start = time.perf_counter()
> session = (
>     SparkSession.builder.remote("local[*]")
>     .getOrCreate()
> )
> elapsed = time.perf_counter() - startprint(f"SparkSession startup: 
> {elapsed:.3f}s")
>
> On my M2 MacBook this consistently takes ~3 seconds.
>
> If you’re working on an application that uses Spark and have a local
> dev/test loop setup, every loop will incur this startup cost. This makes
> the entire experience feel incredibly sluggish.
>
> A straightforward solution is to start a persistent Connect server using
> sbin/start-connect-server.sh and set your remote to sc://localhost:{port}.
> In my testing, this cuts the startup time from ~3 seconds to <1 second.
>
> This is good, but as a solution it has some problems:
>
>    1. It’s not discoverable. Users are unlikely to figure this out by
>    themselves.
>    2. It’s not the default behavior. Users ideally should not have
>    anything to figure out at all. It should just work like this in the
>    background.
>    3. A background process needs some tooling to help manage.
>
> I think we can address these problems by doing something like this:
>
>    1. Make .remote("local") create a persistent Connect server in the
>    background by default, and restart/reconnect to it as needed.
>    2. Add a basic CLI to manage the Connect server, like spark connect
>    {start | stop | show}. This CLI can perhaps just be a wrapper around
>    the scripts in sbin/.
>
> There are some details to figure out related to server idle timeouts,
> server discoverability, etc. But before exploring this further with a
> prototype, I wanted to get a reaction from the list.
>
> What do you think?
>
> Nick
>
>
>

Reply via email to