Hi all, after some more discussions and hours spend by Carl Meyer (who is now co-authoring the PEP) and me, here is a new V3 pre-submit draft. It is now more ambitious than the previous draft as should be obvious from the modified abstract (and Carl Meyers and Philip's earlier interactions on this list). There also are more details of how the current link-scraping works among other improvements and incorporations of feedback from discussions here.
We intend to submit this draft tonight to the PEP editors. Feedback now and later remains welcome. I am sure there are issues to be sorted and clarified, among them the versioning-API suggestion by Marc-Andre. Thanks for everybody's support and feedback so far, holger PEP: XXX Title: Transitioning to release-file hosting on PyPI Version: $Revision$ Last-Modified: $Date$ Author: Holger Krekel <hol...@merlinux.eu>, Carl Meyer <c...@oddbird.net> Discussions-To: catalog-sig@python.org Status: Draft (PRE-submit V3) Type: Process Content-Type: text/x-rst Created: 10-Mar-2013 Post-History: Abstract ======== This PEP proposes a backward-compatible two-phase transition process to speed up, simplify and robustify installing from the pypi.python.org (PyPI) package index. To ease the transition and minimize client-side friction, **no changes to distutils or existing installation tools are required in order to benefit from the transition phases, which is to result in faster, more reliable installs for most existing packages**. The first transition phase implements easy and explicit means for a package maintainter to control which release file links are served to present-day installation tools. The first phase also includes the implementation of analysis tools for present-day packages, to support communication with package maintainers and the automated setting of default modes for controling release file links. The second transition phase will result in the current PYPI index to only serve PYPI-hosted files by default. Externally hosted files will still be automatically discoverable through a second index. Present-day installation tools will be able to continue working by specifying this second index. New versions of installation tools shall default to only install packages from PYPI unless the user explicitely wishes to include non-PYPI sites. Rationale ========= .. _history: History and motivations for external hosting -------------------------------------------- When PyPI went online, it offered release registration but had no facility to host release files itself. When hosting was added, no automated downloading tool existed yet. When Philip Eby implemented automated downloading (through setuptools), he made the choice to allow people to use download hosts of their choice. The finding of externally-hosted packages was implemented as follows: #. The PyPI ``simple/`` index for a package contains all links found anywhere in that package's metadata for any release. Links in the "Download-URL" and "Home-page" metadata fields are given ``rel=download`` and ``rel=homepage`` attributes, respectively. #. Any of these links whose target is a file whose name appears to be in the form of an installable source or binary distribution, with basename in the form "packagename-version.ARCHIVEEXT", is considered a potential installation candidate. #. Similarly, any links suffixed with an "#egg=packagename-version" fragment are considered an installation candidate. #. Additionally, the ``rel=homepage`` and ``rel=download`` links are followed and, if HTML, are themselves scraped for release-file links in the above formats. Today, most packages released on PyPI host their release files on PyPI, but a small percentage (XXX need updated data) rely on external hosting. There are many reasons [2]_ why people have chosen external hosting. To cite just a few: - release processes and scripts have been developed already and upload to external sites - it takes too long to upload large files from some places in the world - export restrictions e.g. for crypto-related software - company policies which require offering open source packages through own sites - problems with integrating uploading to PYPI into one's release process (because of release policies) - desiring download statistics different from those maintained by PyPI - perceived bad reliability of PYPI - not aware that PyPI offers file-hosting Irrespective of the present-day validity of these reasons, there clearly is a history why people choose to host files externally and it even was for some time the only way you could do things. Problem ------- **Today, python package installers (pip, easy_install, buildout, and others) often need to query many non-PyPI URLs even if there are no externally hosted files**. Apart from querying pypi.python.org's simple index pages, also all homepages and download pages ever specified with any release of a package are crawled by an installer. The need for installers to crawl external sites slows down installation and makes for a brittle and unreliable installation process. Those sites and packages also don't take part in the :pep:`381` mirroring infrastructure, further decreasing reliability and speed of automated installation processes around the world. Most packages are hosted directly on pypi.python.org [1]_. Even for these packages, installers still crawl the homepage(s) of a package. Many package uploaders are not aware that specifying the "homepage" in their release process will slow down the installation process for all users. Relying on third party sites also opens up more attack vectors for injecting malicious packages into sites using automated installs. A simple attack might just involve getting hold of an old now-unused homepage domain and placing malicious packages there. Moreover, performing a Man-in-The-Middle (MITM) attack between an installation site and any of the download sites can inject malicious packages on the installation site. As many homepages and download locations are using HTTP and not HTTPS, such attacks are not hard to launch. Such MITM attacks can easily happen even for packages which never intended to host files externally as their homepages are contacted by installers anyway. There is currently no way for package maintainers to avoid 3rd party crawling, other than removing all homepage/download url metadata for all historic releases. While a script [3]_ has been written to perform this action, it is not a good general solution because it removes semantic information like the "homepage" specification from PYPI packages. Even if the "Homepage" and "Download-URL" links were not scraped for further links, there is still no way under the current system for a package owner to link to an installable file from their package metadata without installation tools automatically considering that file a candidate for installation. Solution / two transition phases ================================ This first transition phase starts off by introducing a "hosting-mode" field for each project on PYPI, allowing explicit control of which machine-readable release file links are served to present-day installation tools. The first transition will, after successful hosting-mode manipulations of individual early-adopters, then set a default hosting mode for existing packages, based on automated anaylsis. **Maintainers will be notified one month ahead of any such automated change**. At completion of the first transition phase, **all present-day existing release and installation processes and tools are expected to continue working**. Any remaining errors or problems are expected to only relate to installation of individual packages and can be easily corrected by package maintainers or PYPI admins if maintainers are not reachable. **The second transition phase will then get PyPI, after a three month warning period, to only serve links for PyPI-hosted packages under the present-day ``simple/`` index**. At this point, present-day installation tools will not see externally hosted links anymore, unless they specify a new ``simple/-with-externals`` index which PYPI MUST offer ahead of the start of the second transition phase. This new index contains the external links as controled by a package maintainer. Moreover, PYPI MUST also provide means to register and control download links, independently from the current metadata and remote html-scraping methods. At completion of the second transition phase, all present-day installation tools will and all future installation releases SHALL default to only install PYPI-hosted packages unless a user specifies option(s) to include external links or the external index. If an installation tool chooses to use the new ``simple/-with-externals/`` as a default, it MUST warn a user with a precise messsage of which external links were followed. Maintainers of packages which currently host release files on non-PyPI sites shall receive instructions and tools to ease "re-hosting" of their historic and future package release files. The implementation of such a re-hosting tool is expected but NOT REQUIRED to be available at the beginning of phase 2. Implementation ============== The foundation of both transition phases is the introduction of three "modes" of PyPI hosting for a package, effecting which links are generated for the ``simple/`` index in transition phase 1. These modes are implemented without requiring changes to installation tools via changes to the algorithm for generating the machine-readable "/simple" index. The modes are: - ``pypi-ext-crawl``: no change from the current situation of generating machine-readable links for installation tools, as outlined in the history_. - ``pypi-ext``: for a package in this mode, the "Home-page" and "Download-url" links added to the simple index are given ``rel=ext-homepage`` and ``rel=ext-download`` attributes instead of ``rel=homepage`` and ``rel=download``. The effect of this (with no change in installation tools neccessary) is that these links will not be followed and scraped for further candidate links. Only installable files linked directly from PyPI metadata (wherever they are hosted) will be considered for installation. - ``pypi-only``: for a package in this mode, only links to URLs on PyPI itself will be added to the simple index. At the end of the warning period of transition phase 2, the ``simple/`` index will be restricted to only show links to URLs on PyPI itself while the ``simple/-with-externals`` index will during both transition phases show links to PYPI and any externals as controled by the package maintainer and the hosting-mode. For a package in ``pypi-only`` mode, external links will no longer be automatically scraped from metadata and added to the two indexes. However, PyPI will expose an interface for package maintainers to explicitly specify any number of URLs to externally hosted installable files for a given release, and these URLs will be added to the ``simple/-with-ext`` index page for that project but NOT to the basic ``simple/`` index page. Thus the ``-with-ext`` alternative index provides a means for package owners with good reason to host their packages elsewhere a means to do so (even under the ``pypi-only`` package mode) and still have that information reflected on PyPI in machine-readable form, allowing installation tool users an explicit and easy choice of whether they wish to read an index that includes externally-hosted packages or one that does not. The goal of this PEP is that eventually all projects on PyPI can be migrated to the ``pypi-only`` mode, while preserving the ability to install release files hosted from third parties in an automated manner. Deprecation of hosting-modes to eventually only allow the "pypi-only" mode is NOT REGULATED by this PEP but is expected to become feasible some time after successfull implementation of the two transition phases described in this PEP. Implementation and interaction timeline -------------------------------------------------- The proposed solution consists of multiple implementation and communication steps: #. Implement in PyPI the three modes and the ``-with-ext`` index as described above, and an interface for package owners to select the mode for each package and register explicit external file URLs for the ``-with-ext`` index (for projects in the ``pypi-only`` mode). Default all newly-registered packages to ``pypi-only`` mode (but package owners can still switch to the other modes as desired). Implement in ``pep381client`` the mirroring of the ``-with-ext`` index pages. #. Determine which packages have installable versions available that are linked only from homepage/download pages (group B) and which packages have all installable files available on PyPI itself (group A). #. Send mail to maintainers of projects in group A that their project is going to be automatically configured to ``pypi-ext`` mode in one month. Inform them that this change is not expected to affect installability of their project at all, but will result in faster and safer installs for their users. Encourage them to set this mode (or ``pypi-only``) themselves earlier to benefit their users. #. Send mail to maintainers of packages in group B that their package hosting mode is ``pypi-ext-crawl``, list the sites which currently are crawled, and suggest that they re-host their packages directly on PyPI and then switch to ``pypi-only``. Provide instructions and tools to help with this "re-uploading" process. In addition, maintainers of installation tools are asked to release two updates. The first one shall provide clear warnings if externally-hosted packages (that is, packages at a URL whose domain name differs from the domain name of the index URL in use) are selected for download, for which projects and URLS exactly this happens, and that in future versions externally-hosted downloads will be disabled by default. The second update for installation tools should change the default mode to allow only installation of package files hosted at the index domain, and allow installation of externally-hosted packages only when the user supplies an option (ideally an option specifying exactly which external domains are to be trusted as download sources). When download of an externally-hosted package is disallowed, the user should be notified, with instructions for how to make the install succeed and warnings about the potential consequences. It is expected that tools in this release may choose to change the default index url to ``https://pypi.python.org/simple/-with-ext`` in order to support explicitly-registered external URLs for projects in ``pypi-only`` mode. Tools may choose to do this only when the user requests installation of externally-hosted packages, or may choose to do this in all cases so as to be able to notify users when an externally-hosted file is available. Specific timelines for deprecation of ``pypi-ext-crawl`` and ``pypi-ext`` modes are not mandated in this PEP; this will depend on observed behavior of package owners and availability of tooling. It is expected that ``pypi-ext-crawl`` mode will be an early candidate for deprecation; it may be necessary to leave ``pypi-ext`` mode in place for quite some time, at least for those packages already depending on it (it may be removed as an option for new packages when tool support for explicit external URLs and the ``-with-ext`` index is sufficient). Open questions ============== - Should we introduce a third index which maintains the old behaviour of providing links irrespective of a maintainer's hosting-mode choice? - should we introduce some form of PYPI API versioning in this PEP? (it might complicate matters and delay the implementation but is often seen as good practise) References ========== .. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted, http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html (XXX need to update this data for all easy_install-supported formats) .. [2] Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html .. [3] Holger Krekel, Script to remove homepage/download metadata for all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html Acknowledgements ================ Philip Eby for precise information and the basic ideas to implement the transition via server-side changes only. Donald Stufft for pushing away from external hosting and and offering to implement both a Pull Request for the neccessary PYPI changes and the analysis tool to drive the transition phase 1. Marc-Andre Lemburg, Nick Coghlan and catalog-sig in general for thinking through issues regarding getting rid of "external hosting". Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: _______________________________________________ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig