Thanks, Holger. This version looks a lot better :-) There are still some minor quirks which would need to be addressed more explicitly, but overall, this proposal provides a good way forward.
Perhaps it would also be possible to add the secured download links and the caching/proxying ideas to the PEP at some point, or we turn those into a new PEP. I can't follow up in detail today, but will have a closer look next week. On 15.03.2013 10:29, holger krekel wrote: > Hi all, in particular Philip, Marc-Andre, Donald, > > Carl and me decided to simplify the PEP and avoid the somewhat > awkward ``simple/-with-externals`` index for various reasons, among them > Marc-Andre's criticisms. This also means present-day installation tools > (shipped with Redhat/Debian/etc.) will continue to work as today for > those packages which remain in a hosting-mode that requires crawling and > scraping. They will still benefit from the fact that most packages will > soon have a hosting-mode that avoids it. Future releases of installation > tools will default to not perform crawling or using (scraped) external > links, and new PYPI projects will default to only serve uploaded files. > > The V4 pre-PEP also renames the three PyPI hosting modes to be more > descriptive. Since all three modes allow external links, "pypi-ext" vs > "pypi-only" were misleading. The new naming distinguishes the mode that both > scrapes links from metadata and crawls external pages for more links > ("pypi-scrape-crawl") from the mode that only scrapes links from metadata > ("pypi-scrape") from the mode where all links are explicit ("pypi-explicit"). > > Without the separate external index, it also turns out that the two transition > phases are separated into PyPI changes (phase one) and installer-tool > updates (phase two). There are no PyPI changes necessary in phase two. > As stated in a new open question, it should be possible to do > PEP-related installation tool updates during phase 1, that may require > a bit of clarification in the PEP's language still. > > Carl and me are happy with this PEP version now and hope you all are as > well. Donald is already working on improving the analysis tool so > we hopefully have some updated numbers soon. > > cheers, > > Holger > > > PEP: XXX > Title: Transitioning to release-file hosting on PyPI > Version: $Revision$ > Last-Modified: $Date$ > Author: Holger Krekel <hol...@merlinux.eu>, Carl Meyer <c...@oddbird.net> > Discussions-To: catalog-sig@python.org > Status: Draft (PRE-submit V4) > Type: Process > Content-Type: text/x-rst > Created: 10-Mar-2013 > Post-History: > > > Abstract > ======== > > This PEP proposes a backward-compatible two-phase transition process > to speed up, simplify and robustify installing from the > pypi.python.org (PyPI) package index. To ease the transition and > minimize client-side friction, **no changes to distutils or existing > installation tools are required in order to benefit from the first > transition phase, which will result in faster, more reliable installs > for most existing packages**. > > The first transition phase implements an easy and explicit means for a > package maintainer to control which release file links are served to > present-day installation tools. The first phase also includes the > implementation of analysis tools for present-day packages, to support > communication with package maintainers and the automated setting of > default modes for controlling release file links. The first phase > also will make new projects on PYPI use a default to only serve > links to release files which were uploaded to PYPI. > > The second transition phase concerns end-user installation tools, > which shall default to only install release files that are hosted on > PyPI and tell the user if external release files exist, offering > a choice to automatically use those external files. > > > Rationale > ========= > > .. _history: > > History and motivations for external hosting > -------------------------------------------- > > When PyPI went online, it offered release registration but had no > facility to host release files itself. When hosting was added, no > automated downloading tool existed yet. When Philip Eby implemented > automated downloading (through setuptools), he made the choice to > allow people to use download hosts of their choice. The finding of > externally-hosted packages was implemented as follows: > > #. The PyPI ``simple/`` index for a package contains all links found > by scraping them from that package's long_description metadata for > any release. Links in the "Download-URL" and "Home-page" metadata > fields are given ``rel=download`` and ``rel=homepage`` attributes, > respectively. > > #. Any of these links whose target is a file whose name appears to be > in the form of an installable source or binary distribution, with > name in the form "packagename-version.ARCHIVEEXT", is considered a > potential installation candidate by installation tools. > > #. Similarly, any links suffixed with an "#egg=packagename-version" > fragment are considered an installation candidate. > > #. Additionally, the ``rel=homepage`` and ``rel=download`` links are > crawled by installation tools and, if HTML, are themselves scraped > for release-file links in the above formats. > > Today, most packages released on PyPI host their release files on > PyPI, but a small percentage (XXX need updated data) rely on external > hosting. > > There are many reasons [2]_ why people have chosen external > hosting. To cite just a few: > > - release processes and scripts have been developed already and upload > to external sites > > - it takes too long to upload large files from some places in the > world > > - export restrictions e.g. for crypto-related software > > - company policies which require offering open source packages > through own sites > > - problems with integrating uploading to PyPI into one's release > process (because of release policies) > > - desiring download statistics different from those maintained by PyPI > > - perceived bad reliability of PyPI > > - not aware that PyPI offers file-hosting > > Irrespective of the present-day validity of these reasons, there > clearly is a history why people choose to host files externally and it > even was for some time the only way you could do things. This PEP > takes the position that there are at least some valid reasons for > external hosting. > > Problem > ------- > > **Today, python package installers (pip, easy_install, buildout, and > others) often need to query many non-PyPI URLs even if there are no > externally hosted files**. Apart from querying pypi.python.org's > simple index pages, also all homepages and download pages ever > specified with any release of a package are crawled by an installer. > The need for installers to crawl external sites slows down > installation and makes for a brittle and unreliable installation > process. Those sites and packages also don't take part in the > :pep:`381` mirroring infrastructure, further decreasing reliability > and speed of automated installation processes around the world. > > Most packages are hosted directly on pypi.python.org [1]_. Even for > these packages, installers still crawl their homepage and > download-url, if specified. Many package uploaders are not aware that > specifying the "homepage" or "download-url" in their package metadata > will needlessly slow down the installation process for all users. > > Relying on third party sites also opens up more attack vectors for > injecting malicious packages into sites using automated installs. A > simple attack might just involve getting hold of an old now-unused > homepage domain and placing malicious packages there. Moreover, > performing a Man-in-The-Middle (MITM) attack between an installation > site and any of the download sites can inject malicious packages on > the installation site. As many homepages and download locations are > using HTTP and not HTTPS, such attacks are not hard to launch. Such > MITM attacks can easily happen even for packages which never intended > to host files externally as their homepages are contacted by > installers anyway. > > There is currently no way for package maintainers to avoid > external-link crawling, other than removing all homepage/download url > metadata for all historic releases. While a script [3]_ has been > written to perform this action, it is not a good general solution > because it removes useful metadata from PyPI releases. > > Even if the sites referenced by "Homepage" and "Download-URL" links were > not scraped for further links, there is no obvious way under the current > system for a package owner to link to an installable file from a > long_description metadata field (which is shown as package documentation > on ``/pypi/PKG``) without installation tools automatically considering > that file a candidate for installation. Conversely, there is no way > to explicitely register multiple external release files without > putting them in metadata fields. > > > Goals > ----- > > These are the goals to be achieved by implementation of this PEP: > > * Package owners should be able to explicitly control which files are > presented by PyPI to installer tools as installation > candidates. Installation should not be slowed and made less reliable > by extensive and unnecessary crawling of links that package owners > did not explicitly nominate as installation files. > > * It should remain possible for package owners to choose to host their > release files on their own hosting, external to PyPI. It should be > easy for a user to request the installation of such releases using > automated installer tools. > > * Automated installer tools should not install externally-hosted > packages **by default**, but only when explicitly authorized to do > so by the user. When tools refuse to install such a package by > default, they should tell the user exactly which external link(s) > they would need to follow, and what option(s) the user can provide > to authorize the tool to follow those links. PyPI should provide all > necessary metadata for installer tools to implement this easily > and within a single request/reply interaction. > > * Migration from the status quo to the above points should be gradual > and minimize breakage. This includes tooling that makes it easy for > package owners with an existing release process that uploads to > non-PyPI hosting to also upload those release files to PyPI. > > > Solution / two transition phases > ================================ > > The first transition phase introduces a "hosting-mode" field for each > project on PyPI, allowing package owners explicit control of which > release file links are served to present-day installation tools in the > machine-readable ``simple/`` index. The first transition will, after > successful hosting-mode manipulations by individual early-adopters, > set a default hosting mode for existing packages, based on > automated analysis. **Maintainers will be notified one month ahead of > any such automated change**. At completion of the first transition > phase, **all present-day existing release and installation processes > and tools are expected to continue working**. Any remaining errors or > problems are expected to only relate to installation of individual > packages and can be easily corrected by package maintainers or PyPI > admins if maintainers are not reachable. > > Also in the first phase, each link served in the ``simple/`` index > will be explicitly marked as ``rel="internal"`` (hosted by the index > itself) or ``rel="external"`` (linking to an external site that is not > part of the index). > > In the second transition phase, PyPI client installation tools shall > be updated to default to only install ``rel="internal"`` packages > unless a user specifies option(s) to permit installing from external > links. > > Maintainers of packages which currently host release files on non-PyPI > sites shall receive instructions and tools to ease "re-hosting" of > their historic and future package release files. This re-hosting tool > MUST be available before automated hosting-mode changes are announced > to package maintainers. > > > Implementation > ============== > > Hosting modes > ------------- > > The foundation of the first transition phase is the introduction of > three "modes" of PyPI hosting for a package, affecting which links are > generated for the ``simple/`` index. These modes are implemented > without requiring changes to installation tools via changes to the > algorithm for generating the machine-readable ``simple/`` index. > > The modes are: > > - ``pypi-scrape-crawl``: no change from the current situation of > generating machine-readable links for installation tools, as > outlined in the history_. > > - ``pypi-scrape``: for a package in this mode, links to be added to > the ``simple/`` index are still scraped from package > metadata. However, the "Home-page" and "Download-url" links are > given ``rel=ext-homepage`` and ``rel=ext-download`` attributes > instead of ``rel=homepage`` and ``rel=download``. The effect of this > (with no change in installation tools necessary) is that these links > will not be followed and scraped for further candidate links by present-day > installation tools: only installable files directly hosted from PYPI or > linked directly from PyPI metadata will be considered for installation. > Installation tools MAY evolve to offer an option to use the new > rel-attribution to crawl external pages but MUST NOT default to it. > > - ``pypi-explicit``: for a package in this mode, only links to release > files uploaded to PyPI, and external links to release files > explicitly nominated by the package owner (via a new interface > exposed by PyPI) will be added to the ``simple/`` index. > > Thus the hope is that eventually all projects on PyPI can be migrated > to the ``pypi-explicit`` mode, while preserving the ability to install > release files hosted externally via installer tools. Deprecation of > hosting modes to eventually only allow the ``pypi-explicit`` mode is > NOT REGULATED by this PEP but is expected to become feasible some time > after successful implementation of the transition phases described in > this PEP. It is expected that deprecation requires **a new process to deal > with abandoned packages** because of unreachable maintainers for still > popular packages. > > > First transition phase (PyPI) > ----------------------------- > > The proposed solution consists of multiple implementation and > communication steps: > > #. Implement in PyPI the three modes described above, with an > interface for package owners to select the mode for each package > and register explicit external file URLs. > > #. For packages in all modes, label all links in the ``simple/`` index > with ``rel="internal"`` or ``rel="external"``, to make it easier > for client tools to distinguish the types of links in the second > transition phase. > > #. Default all newly-registered packages to ``pypi-explicit`` mode > (package owners can still switch to the other modes as desired). > > #. Determine (via an automated analysis tool) which packages have all > installable files available on PyPI itself (group A), which have > all installable files linked directly from PyPI metadata (group B), > and which have installable versions available that are linked only > from external homepage/download HTML pages (group C). > > #. Send mail to maintainers of projects in group A that their project > will be automatically configured to ``pypi-explicit`` mode in one > month, and similarly to maintainers of projects in group B that > their project will be automatically configured to ``pypi-scrape`` > mode. Inform them that this change is not expected to affect > installability of their project at all, but will result in faster > and safer installs for their users. Encourage them to set this > mode themselves sooner to benefit their users. > > #. Send mail to maintainers of packages in group C that their package > hosting mode is ``pypi-scrape-crawl``, list the URLs which > currently are crawled, and suggest that they either re-host their > packages directly on PyPI and switch to ``pypi-explicit``, or at > least provide direct links to release files in PyPI metadata and > switch to ``pypi-scrape``. Provide instructions and tools to help > with these transitions. > > > Second transition phase (installer tools) > ----------------------------------------- > > For the second transition phase, maintainers of installation tools are > asked to release two updates. > > The first update shall provide clear warnings if externally-hosted > release files (that is, files whose link is ``rel="external"``) are > selected for download, for which projects and URLs exactly this > happens, and warn that in future versions externally-hosted downloads > will be disabled by default. > > The second update should change the default mode to allow only > installation of ``rel="internal"`` package files, and allow > installation of externally-hosted packages only when the user supplies > an option (ideally an option specifying exactly which external domains > are to be trusted as download sources). When download of an > externally-hosted package is disallowed, the user should be notified, > with instructions for how to make the install succeed and warnings > about the implication (that a file will be downloaded from a site that > is not part of the package index). > > > Open questions / Tasks > =========================== > > - Should we introduce some form of PyPI API versioning in this PEP? > (it might complicate matters and delay the implementation but is > often seen as good practise). > > - in pypi-scrape mode: does PYPI determine itself what are installation > candidates and avoids presenting other random links (which are currently > served)? > > - consider that installation tools may choose to release updates > during transition phase 1 already, to warn about crawling and scraped > links (which are easily identifiable today and after the new rel-attribution > after transition phase 1). > > > References > ========== > > .. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted, > http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html (XXX need > to update this data for all easy_install-supported formats) > > .. [2] Marc-Andre Lemburg, reasons for external hosting, > http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html > > .. [3] Holger Krekel, Script to remove homepage/download metadata for all > releases > http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html > > Acknowledgments > ================ > > Philip Eby for precise information and the basic ideas to implement > the transition via server-side changes only. > > Donald Stufft for pushing away from external hosting and offering to > implement both a Pull Request for the necessary PyPI changes and the > analysis tool to drive the transition phase 1. > > Marc-Andre Lemburg, Nick Coghlan and catalog-sig in general for > thinking through issues regarding getting rid of "external hosting". > > Copyright > ========= > > This document has been placed in the public domain. > > > > .. > Local Variables: > mode: indented-text > indent-tabs-mode: nil > sentence-end-double-space: t > fill-column: 70 > coding: utf-8 > End: > > _______________________________________________ > Catalog-SIG mailing list > Catalog-SIG@python.org > http://mail.python.org/mailman/listinfo/catalog-sig > -- Marc-Andre Lemburg PSF Vice Chairman _______________________________________________ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig