On 12 May 2014 12:27, Donald Stufft <don...@stufft.io> wrote: > > On May 11, 2014, at 7:35 PM, Donald Stufft <don...@stufft.io> wrote: > > However before I go further on that I want to dig more into the impact of > these > things. It dawned on me earlier today that the way I was categorizing things > in my earlier number crunching was making it unreasonably hard to actually > divine any sort of meaning out of those numbers. I'm currently in the > process > of crawling all of PyPI again*, after I have those new numbers I'll have a > better sense of things and I think a better forward plan can be made. > > > I've completed the crawl. I've made the scripts and the data available at > https://github.com/dstufft/pypi-external-stats.
Thanks for that. > Here's the general statistics from that: > > Hosted on PyPI: 37779 > Hosted Externally (<50%): 18 > Hosted Externally (>50%): 47 > Hosted Externally: 65 > Hosted Unsafely (<50%): 725 > Hosted Unsafely (>50%): 2249 > Hosted Unsafely: 2974 >From counting the number of "external1" packages in the JSON data you linked, I take it "external1" & "external2" correspond to < 50% and > 50% (and ditto for "unsafe1" and "unsafe2")? "pyOpenSSL" is the main one that catches my eye in the externally hosted category, but closer investigation shows that is being thrown off by an older external link for 0.11. All other releases, including the newer 0.12, 0.13 and 0.14 releases are PyPI hosted. (If it's practical, a "latest" release vs "any" release split would be even more useful than the current more or less than 50% split - if the latest release is externally hosted, silently receiving an older version can actually be more problematic than not receiving a version at all, and cases like pyOpenSSL show that even this new categorisation may be overstating the number of packages relying on external hosting). There are some more notable names in the "unsafe" lists, but a few spot checks on projects like PyGObject, PyGTK, biopython, dbus-python, django-piston, ipaddr, matplotlib, and mayavi showed that a number of them *have* switched to PyPI hosting for recent releases, but have left older releases as externally hosted. (A few notable names, like wxPython and Spyder, *did* show up as genuinely externally hosted. Something that would be nice to be able to do, but isn't really practical without a server side dependency graph, is to be able to figure out how many packages have an externally hosted dependency *somewhere in their dependency chain*, and *how many* other projects are depending on particular externally hosted projects transitively). Regardless, even with those caveats, the numbers are already solid enough to back up the notion that the only possible reasons to support enabling verified external hosting support independently of unverified external hosting are policy and relationship management ones. Relationship management would just mean providing a deprecation period before removing the capability, but I want to spend some time exploring a possible concrete *policy* related rationale for keeping it. The main legitimate reason I am aware of for wanting to avoid PyPI hosting is for non-US based individuals and organisations to avoid having to sign up to the "Any uploads of packages must comply with United States export controls under the Export Administration Regulations." requirement that the PSF is obliged to place on uploads to the PSF controlled US hosted PyPI servers. That rationale certainly applies in MAL's case, since eGenix is a German company, and I believe they mostly do business outside the US (for example, their case study in the Python brochure is for a government project in Ghana). In relation to that, I double checked the egenix-mx-base package, and (as noted earlier in the thread) that is one that *could* be transitively verified, since a hash is provided on PyPI for the linked index pages, which could be used to ensure that the hashes of the download links are correct. That transitive verification could either be done by pip on the fly, or else implemented as a tool that scanned the linked page for URLs once, checked the hash and then POSTed the specific external URLs to PyPI - the latter approach would have the advantage of also speeding up downloads of affected packages by allowing the project to be set to the "pypi-explicit" hosting mode. That means the long term fate of a global "--allow-all-verifiable-external" flag really hinges on a policy decision: do we want to ensure it remains possible for non-US software distributors to avoid subjecting their software to US export law, without opening up their users to MITM attacks on other downloads? Note that the occasionally recommended alternative to external link support, adding a new index URL client side, is in itself a greater risk than allowing verifiable external downloads linked from PyPI, since dependency resolution and package lookups in general aren't scoped by index URL - you're trusting the provider of a custom index to not publish a "new" version of other PyPI packages that overrides the PyPI version (even Linux distros haven't systematically solved that problem, although tools like the yum priorities plugin address most of the issues). After considering the policy implications, and the deficiencies of the "just run your own index server" approach, I think it makes sense to preserve the "--allow-all-verifiable-external" option indefinitely, even if it's confusing: it means we're leaving the option open for individual projects and organisations to decide to accept a slightly degraded user experience in order to remain free of entanglement with US export restrictions, as well as allowing end users the option to globally enable packages that may not comply with US export restrictions (because they may be hosted outside the US), without opening themselves up to additional security vulnerabilities. By contrast, dropping this feature entirely would mean saying to non-US users "you must agree to US export restrictions in order to participate in PyPI at all", and I don't think we want to go down that path. Under that approach, per-package "--allow-external" settings would still become the recommended solution for installation issues (since it always works, regardless of whether or not the project is set up to do it safely), the "--allow-all-external" option would be deprecated in 1.6 and removed in 1.7, and "--allow-all-verifiable-external" would be added as a non-deprecated spelling for the not-necessarily-subject-to-US-export-laws external hosting support. At-least-we're-not-dealing-with-ITAR-ly yours, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig