Re: [discuss] Pinning PySpark dependencies?

Nicholas Chammas Fri, 27 Mar 2026 12:44:16 -0700

One thing we have tried unsuccessfully in the past to do is to separate 
abstract dependencies (which should be as flexible as possible) from concrete 
build/test dependencies (which should be pinned).


There are many tools in the Python ecosystem today for automatically compiling 
and bumping concrete dependencies from abstract ones, from the conservative 
(like pip-tools) which allow us to minimize changes to our dev process, to the 
more disruptive (like Poetry or uv) that require larger changes. But even a 
conservative approach requires significant effort, and specifically requires 
testing of the release process from committers that cannot be delegated to CI. 
This is one of the main reasons my past attempt to solve this problem failed 
<https://github.com/apache/spark/pull/27928>.

Pinning concrete build/test dependencies is related to but separate from 
mitigating supply chain attacks. But it’s probably a good thing we should try 
to solve again which will in turn make any other future work related to 
dependency management easier to carry out.

Last time I looked at this problem in 2024 
<https://www.mail-archive.com/[email protected]/msg31792.html>, I identified 
as an illustrative example at least 12 separate places where we specify numpy 
as a dependency. Of course, they were not consistent.

Solving this problem requires that 1) at least one committer shepherd the 
solution through, and that includes some manual testing they will have to do, 
and 2) as a project we need to consciously separate abstract from concrete 
dependencies.

Nick


> On Mar 27, 2026, at 3:00 PM, Tian Gao via dev <[email protected]> wrote:
> 
> I have doubts about pinning pyspark dependencies. First of all, what do you 
> mean by "pinning"? Do we pin to a specific version, or set an upper version? 
> I don't think any library should pin its dependencies to a specific version 
> because that's doomed to conflict with others. Setting an upper version also 
> has issues. First of all, we limit the users' ability to use the latest 
> library versions, which might contain needed new features. And we can still 
> have conflict issue - it doesn't have to be super old, the new version could 
> release the next day.
> 
> More importantly, pinning versions does *not* make you more secure. Yes I 
> know we are dealing with a supply chain crisis and everyone believes we'll be 
> fine if we don't upgrade our versions - but the more common security issue is 
> CVE. Vulnerabilities are found every day (even faster now with LLMs) and 
> upgrading to the latest versions is important to keep everyone safe. If the 
> next big news is some attacker exploiting an existing CVE to hack old 
> libraries, the discussion will be: Why do we pin versions? That's unsafe!
> 
> Pinning versions is a double-edged sword, it doesn't always make us more 
> secure - that's my major point.
> 
> So who should pin versions? I think the answer is service, not libraries. We 
> should not dictate what version of other libraries the users should use. 
> That's the job for the end service.
> 
> For spark specifically. We should probably pin versions for our GitHub 
> Actions and testing environment, but not for pyspark itself. The major 
> difference here (if we do upper version limit) is that we can't respond fast 
> enough if the old version of a dependency has an issue. We need to recognize 
> the issue, update the dependency list, and release a new version of spark. 
> That's normally too slow (and too much work if we must follow every 
> dependency) for a widely used library. 
> 
> `pyspark[pinned]` might be a way to do it, but for which one? We have 
> `pyspark[pandas_on_spark]`, `pyspark[connect]`, `pyspark[ml]` ... Do we 
> create a `pinned` for every optional package? Are we able to keep updating 
> them?
> 
> Overall, I just don't believe it's a good idea for a library, any library, to 
> pin its dependency versions (especially to a specific version). That 
> introduces far more trouble than benefit. It would be interesting to see if 
> any widely used library does that.
> 
> Tian
> 
> On Fri, Mar 27, 2026 at 9:32 AM Holden Karau <[email protected] 
> <mailto:[email protected]>> wrote:
>> In the past few weeks we’ve seen multiple PyPi published packages 
>> compromised, thankfully none of them are PySpark dependencies, but it seems 
>> like we might want to consider something.
>> 
>> Downside of pinning all dependencies is a higher likelihood of conflicts if 
>> folks keep running old versions of PySpark for a long time. One possibility 
>> would be to make the pinned version optional (eg pyspark[pinned]) or publish 
>> a separate constraints file for people to optionally use with -c?
>> 
>> I’m wondering do other folks share my concern here?
>> 
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.): 
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her

Re: [discuss] Pinning PySpark dependencies?

Reply via email to