On May 11, 2014, at 7:35 PM, Donald Stufft <[email protected]> wrote:

> However before I go further on that I want to dig more into the impact of 
> these
> things. It dawned on me earlier today that the way I was categorizing things
> in my earlier number crunching was making it unreasonably hard to actually
> divine any sort of meaning out of those numbers. I'm currently in the process
> of crawling all of PyPI again*, after I have those new numbers I'll have a
> better sense of things and I think a better forward plan can be made.


I've completed the crawl. I've made the scripts and the data available at
https://github.com/dstufft/pypi-external-stats.

Here's the general statistics from that:

Hosted on PyPI: 37779
Hosted Externally (<50%): 18
Hosted Externally (>50%): 47
Hosted Externally: 65
Hosted Unsafely (<50%): 725
Hosted Unsafely (>50%): 2249
Hosted Unsafely: 2974

The data more or less follows what the rest of the data has pointed to. However
I've changed my method of categorizing the projects. Previously I had split the
projects into "only has filed hosted using type X" and "has any files hosted
using type X". This categorization made it hard to accurately determine impact.
The problem is that a lot of projects have the same files uploaded to PyPI, but
also available unsafely. A project like this will not be impacted by a change
in hosting however it wasn't possible to determine this using the previous
data.

The new method splits all of the files for a particular project into a set of
{PyPI, External, Unsafe}. It splits every file it finds into one of these
categories. Finally once it has filled out the categories for all of them it
it removes duplicate files (via exact filenames). It prefers files hosted on
PyPI over files hosted externally, and it prefers files hosted externally over
those hosted unsafely. This leads to the projects like the above example to
accurately represent where the *best* source for it's files are, not anywhere
it can locate that file.

The statistics also split out projects which have > 50% of their files
hosted externally or unsafely apart from files which have < 50% of their files
hosted externally or unsafely. The reasoning behind this is that there are
projects which have one or two files hosted externally or unsafely and the
impact of changes in this area are much less for a project that hosts all of
it's files externally or unsafely vs one that has just one or two old releases
hosted in that fashion. For completeness sake I've also included the total
numbers for each of the split options for easier comparison.

Finally it's important to note that defining what exactly is an installable
file is difficult to do. In this script I've tried to take a maximal stance and
err on the side of assuming something is an installable file. Specifically I
do not have any detection of:

* Filenames do not match the project name (e.g. bar-1.0.tar.gz linked from
  foo's page).

* The file that is being linked to still exists at all (e.g. 404 or NXDOMAIN).

* The file that is being linked to unpacks successfully and has a setup.py and
  or other requirements to be a successfully installed package.

* (pip specific) The file has a sane version number that follows PEP440 and/or
  is not a pre-release.

* It is unlikely that these numbers are accurate for any one particular
  installer. In particular pip does not support .egg's but this detection does
  however pip, and this detection, does support .whl's while setuptools does
  not.

The rules for detection are essentially:

1. Look at /simple/<foo>/ for that project.
2. Look for any URL with a rel=internal and count it as an PyPI hosted file.
3. Look for any URL that "looks" installable, this means that the path in the
   URL ends with {.tar, .tar.gz, tar.bz2, .zip, .tgz, .egg, .whl} which also
   has a #<hashname>=<hashvalue> fragment and count it as a externally hosted
   file.
4. Look for any URL that "looks" installable which does not have a hash URL
   fragment and count it as an unsafely hosted file.
5. Look for any URL that does not "look" installable which has a rel of
   {download, homepage} and process them.
6. Look at the HTML from #5 and look for URLs that look installable, with or
   without a hash fragment and count it as an unsafely hosted file.
7. Deduplicate the found filenames by ensuring that each filename exists for
   a project only once, with the preference of PyPI > external > unsafe.


* In all places I've used PyPI to mean hosted on PyPI, external to mean hosted
  externally and safely, and unsafely to mean hosted externally and unsafely.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Distutils-SIG maillist  -  [email protected]
https://mail.python.org/mailman/listinfo/distutils-sig

Reply via email to