On May 11, 2014, at 7:35 PM, Donald Stufft <[email protected]> wrote:
> However before I go further on that I want to dig more into the impact of > these > things. It dawned on me earlier today that the way I was categorizing things > in my earlier number crunching was making it unreasonably hard to actually > divine any sort of meaning out of those numbers. I'm currently in the process > of crawling all of PyPI again*, after I have those new numbers I'll have a > better sense of things and I think a better forward plan can be made. I've completed the crawl. I've made the scripts and the data available at https://github.com/dstufft/pypi-external-stats. Here's the general statistics from that: Hosted on PyPI: 37779 Hosted Externally (<50%): 18 Hosted Externally (>50%): 47 Hosted Externally: 65 Hosted Unsafely (<50%): 725 Hosted Unsafely (>50%): 2249 Hosted Unsafely: 2974 The data more or less follows what the rest of the data has pointed to. However I've changed my method of categorizing the projects. Previously I had split the projects into "only has filed hosted using type X" and "has any files hosted using type X". This categorization made it hard to accurately determine impact. The problem is that a lot of projects have the same files uploaded to PyPI, but also available unsafely. A project like this will not be impacted by a change in hosting however it wasn't possible to determine this using the previous data. The new method splits all of the files for a particular project into a set of {PyPI, External, Unsafe}. It splits every file it finds into one of these categories. Finally once it has filled out the categories for all of them it it removes duplicate files (via exact filenames). It prefers files hosted on PyPI over files hosted externally, and it prefers files hosted externally over those hosted unsafely. This leads to the projects like the above example to accurately represent where the *best* source for it's files are, not anywhere it can locate that file. The statistics also split out projects which have > 50% of their files hosted externally or unsafely apart from files which have < 50% of their files hosted externally or unsafely. The reasoning behind this is that there are projects which have one or two files hosted externally or unsafely and the impact of changes in this area are much less for a project that hosts all of it's files externally or unsafely vs one that has just one or two old releases hosted in that fashion. For completeness sake I've also included the total numbers for each of the split options for easier comparison. Finally it's important to note that defining what exactly is an installable file is difficult to do. In this script I've tried to take a maximal stance and err on the side of assuming something is an installable file. Specifically I do not have any detection of: * Filenames do not match the project name (e.g. bar-1.0.tar.gz linked from foo's page). * The file that is being linked to still exists at all (e.g. 404 or NXDOMAIN). * The file that is being linked to unpacks successfully and has a setup.py and or other requirements to be a successfully installed package. * (pip specific) The file has a sane version number that follows PEP440 and/or is not a pre-release. * It is unlikely that these numbers are accurate for any one particular installer. In particular pip does not support .egg's but this detection does however pip, and this detection, does support .whl's while setuptools does not. The rules for detection are essentially: 1. Look at /simple/<foo>/ for that project. 2. Look for any URL with a rel=internal and count it as an PyPI hosted file. 3. Look for any URL that "looks" installable, this means that the path in the URL ends with {.tar, .tar.gz, tar.bz2, .zip, .tgz, .egg, .whl} which also has a #<hashname>=<hashvalue> fragment and count it as a externally hosted file. 4. Look for any URL that "looks" installable which does not have a hash URL fragment and count it as an unsafely hosted file. 5. Look for any URL that does not "look" installable which has a rel of {download, homepage} and process them. 6. Look at the HTML from #5 and look for URLs that look installable, with or without a hash fragment and count it as an unsafely hosted file. 7. Deduplicate the found filenames by ensuring that each filename exists for a project only once, with the preference of PyPI > external > unsafe. * In all places I've used PyPI to mean hosted on PyPI, external to mean hosted externally and safely, and unsafely to mean hosted externally and unsafely. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Distutils-SIG maillist - [email protected] https://mail.python.org/mailman/listinfo/distutils-sig
