I have doubts about pinning pyspark dependencies. First of all, what do you
mean by "pinning"? Do we pin to a specific version, or set an upper
version? I don't think any library should pin its dependencies to a
specific version because that's doomed to conflict with others. Setting an
upper version also has issues. First of all, we limit the users' ability to
use the latest library versions, which might contain needed new features.
And we can still have conflict issue - it doesn't have to be super old, the
new version could release the next day.

More importantly, pinning versions does *not* make you more secure. Yes I
know we are dealing with a supply chain crisis and everyone believes we'll
be fine if we don't upgrade our versions - but the more common security
issue is CVE. Vulnerabilities are found every day (even faster now with
LLMs) and upgrading to the latest versions is important to keep everyone
safe. If the next big news is some attacker exploiting an existing CVE to
hack old libraries, the discussion will be: Why do we pin versions? That's
unsafe!

Pinning versions is a double-edged sword, it doesn't always make us more
secure - that's my major point.

So who should pin versions? I think the answer is service, not libraries.
We should not dictate what version of other libraries the users should use.
That's the job for the end service.

For spark specifically. We should probably pin versions for our GitHub
Actions and testing environment, but not for pyspark itself. The major
difference here (if we do upper version limit) is that we can't respond
fast enough if the old version of a dependency has an issue. We need to
recognize the issue, update the dependency list, and release a new version
of spark. That's normally too slow (and too much work if we must follow
every dependency) for a widely used library.

`pyspark[pinned]` might be a way to do it, but for which one? We have
`pyspark[pandas_on_spark]`, `pyspark[connect]`, `pyspark[ml]` ... Do we
create a `pinned` for every optional package? Are we able to keep updating
them?

Overall, I just don't believe it's a good idea for a library, any library,
to pin its dependency versions (especially to a specific version). That
introduces far more trouble than benefit. It would be interesting to see if
any widely used library does that.

Tian

On Fri, Mar 27, 2026 at 9:32 AM Holden Karau <[email protected]> wrote:

> In the past few weeks we’ve seen multiple PyPi published packages
> compromised, thankfully none of them are PySpark dependencies, but it seems
> like we might want to consider something.
>
> Downside of pinning all dependencies is a higher likelihood of conflicts
> if folks keep running old versions of PySpark for a long time. One
> possibility would be to make the pinned version optional (eg
> pyspark[pinned]) or publish a separate constraints file for people to
> optionally use with -c?
>
> I’m wondering do other folks share my concern here?
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>

Reply via email to