On October 4, 2015 at 2:22:51 PM, Nathaniel Smith ([email protected]) wrote: > > I guess to make progress in this conversation I need some more > detailed explanations. I totally get that there's a long history > of thought and conversations behind the various assertions > here like "a sdist is fundamentally different from a VCS checkout", > "there must be a 1-1 mapping between sdists and wheels", "pip > needs sdists that have full wheel metadata in static form", and > I'm barging in from the outside with no context, but I literally > have no idea why the specific design features you're asking for > are desirable or even viable. Right now if I were to try and write > the PEP you're asking for, then the rationale section would just > be "because Donald said so" over and over :-). I couldn't write > the motivation section, because I don't know any problems that > the PEP you're describing would fix for me as a package author > (which doesn't mean they don't exist, but!).
I don't mind going into more details! I'll do the things you specifically mentioned and then if there is other things, feel free to bring them up too. I should also mention, that these are my opinions from my experiences with the toolchain and ecosystem, others may agree or disagree with me. I have strong opinions, but that doesn't make them immutable laws of the universe, although "because Donald said so" sounds like a pretty good answer to me ;) "a sdist is fundamentally different from a VCS checkout" This one I have a hard time trying to explain. They are focused on different things. With an sdist you need to have a project name, a version, a list of files, things like that. The use cases and needs for each "phase" are different. For instance, in a VCS checkout you can derrive the list of files or the version by asking the VCS but a sdist doesn't have a VCS so it has to have that baked into it. A more C centric example, is that you often times have something like autogen.sh in a C project's VCS, but you don't have the output of that checked into the VCS, however when you prepare a tarball for distribution you run autogen.sh and then include the output there. There are other differences too, in a VCS we don't really need the ability to statically read any metadata except for build dependencies and how to invoke the build tool. Most everything else can be dynamically configured because you're not distributing that. However in a sdist, we need as much of the metadata to be static as possible. Something like PyPI needs to be able to inspect any of the files uploaded to it (sdist, wheels, etc) for certain information and anything that can't be statically and safely read from it might as well not even exist as far as PyPI is concerned. We currently have the situation where we have a single file that is used for all phases of the process, dev (``setup.py develop`` & ``setup.py sdist``), building of a wheel (``setup.py bdist_wheel``) and even installation sometimes (``setup.py install``). Throughout this there are a lot of common problems where some author tried to optimize their ``setup.py`` for their development use cases and broke it for the other cases. An example of this is version handling, where it's not unusual for someone's first forray into attempting to deduplication version involves importing their thing (which works fine on their machine) and passing it into the setup kwargs. This simple thing would generally work just fine if the output of ``setup.py sdist`` produced static metadata and ``setup.py`` was no longer being used. This also becomes expressed in what interfaces you give to the toolchain at each "phase". It's important for something inside of a VCS checkout to be able to be written by human beings. This leads to wanting to use formats like INI (which is ugly) or something like TOML or YAML or some other nice, human friendly format. These formats are great for humans to write and for humans to read but are not particularly great as data interchange formats. Looking at something like JSON, msgpack, etc are far better for data interchange for computers to talk to other computers, but are not great for humans to write, edit, or even really read in many cases. If we go back to distutils2, you can see this effect happening there, they had two similar keywords arguments in their setup.cfg statements, description and description-file, these both did the same things, but just pulled from different sources (inline or via a file) forcing every tool in the chain to have to support both of these options even though it could have easily made an sdist that was distinct from the VCS code and simplified code there. I see the blurring of lines between the various phases of a package one of the fundamental flaws of distutils and setuptools. "there must be a 1-1 mapping between sdists and wheels" This has technical and social reasons. In the techincal side, the 1-1 mapping between sdists and wheels (and all other bdists) is an assumption baked into all of the tools. From PyPI's enforcement mechanisms, to pip's caching, to things like devpi and the such breaking this assumption will break a lot of code. This is all code and code is not immutable so we could of course change that, however we wouldn't be able to rely on the fact that we've fixed that assumption for many years (probably at least 5 at the earliest, 10+ is more likely). The social side is a bit more interesting though. In Debian, end users almost *never* actually interact with source packages and in near 100% of the time they are interacting soley with built packages (in fact, unlike Python, you have to manually build a deb before you can even attempt to install something). There really aren't "source packages" in Debian, just sources that happen to produce a Debian package. In Python land, a source package is still a package and people have expectations around that, I think people would be very confused if a sdist "foo-1.0.tar.gz" could produce a wheel "bar-3.0.whl". In addition, systems like Debian don't really try to protect against a malicious DD at all. Things like "prevent foo from claiming to be bar" are enforced via societal conventions and the fact that it is not an open repo and there are gatekeepers keeping everything in place. On the flip side, we let anyone upload to PyPI and rely on things like ACLs to secure things. This means that we need to know ahead of time what names a package is going to produce. The simpliest mechanism for this is to enforce a 1:1 mapping between sdist and wheel because that is an immutable property and easy to understand. I could possibly envision something that allowed this, but it would require a project to explicitly declare up front what names it will produce, and require registering those names with PyPI before you could upload a sdist that could produce those named wheels. Ultimately, I don't think the very minor benefits are worth the additional complexity and pain of trying to adapt all of the tooling and human expectations to this. "pip needs sdists that have full wheel metadata in static form" I think I could come around to the idea that some metadata doesn't make sense for a sdist, and that it really needs to be a part of wheels but not a part of sdist. I think that the argument needs to be made in the other direction though, we should assume that all metadata will be included as part of the sdist and then make an argument for why each particular piece of metadata is Wheel specific not specific to a particular version of a project. Things like name, version, description, classifiers, etc are easily able to be classified into specific to a particular (name, version) tuple. Other things like "Python ABI" are easily able to be classified into specific to a particular wheel. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA _______________________________________________ Distutils-SIG maillist - [email protected] https://mail.python.org/mailman/listinfo/distutils-sig
