On May 11, 2014, at 10:27 PM, Donald Stufft <[email protected]> wrote:
> > On May 11, 2014, at 7:35 PM, Donald Stufft <[email protected]> wrote: > >> However before I go further on that I want to dig more into the impact of >> these >> things. It dawned on me earlier today that the way I was categorizing things >> in my earlier number crunching was making it unreasonably hard to actually >> divine any sort of meaning out of those numbers. I'm currently in the process >> of crawling all of PyPI again*, after I have those new numbers I'll have a >> better sense of things and I think a better forward plan can be made. > > > I've completed the crawl. I've made the scripts and the data available at > https://github.com/dstufft/pypi-external-stats. > > Here's the general statistics from that: > > Hosted on PyPI: 37779 > Hosted Externally (<50%): 18 > Hosted Externally (>50%): 47 > Hosted Externally: 65 > Hosted Unsafely (<50%): 725 > Hosted Unsafely (>50%): 2249 > Hosted Unsafely: 2974 > > The data more or less follows what the rest of the data has pointed to. > However > I've changed my method of categorizing the projects. Previously I had split > the > projects into "only has filed hosted using type X" and "has any files hosted > using type X". This categorization made it hard to accurately determine > impact. > The problem is that a lot of projects have the same files uploaded to PyPI, > but > also available unsafely. A project like this will not be impacted by a change > in hosting however it wasn't possible to determine this using the previous > data. > > The new method splits all of the files for a particular project into a set of > {PyPI, External, Unsafe}. It splits every file it finds into one of these > categories. Finally once it has filled out the categories for all of them it > it removes duplicate files (via exact filenames). It prefers files hosted on > PyPI over files hosted externally, and it prefers files hosted externally over > those hosted unsafely. This leads to the projects like the above example to > accurately represent where the *best* source for it's files are, not anywhere > it can locate that file. > > The statistics also split out projects which have > 50% of their files > hosted externally or unsafely apart from files which have < 50% of their files > hosted externally or unsafely. The reasoning behind this is that there are > projects which have one or two files hosted externally or unsafely and the > impact of changes in this area are much less for a project that hosts all of > it's files externally or unsafely vs one that has just one or two old releases > hosted in that fashion. For completeness sake I've also included the total > numbers for each of the split options for easier comparison. > > Finally it's important to note that defining what exactly is an installable > file is difficult to do. In this script I've tried to take a maximal stance > and > err on the side of assuming something is an installable file. Specifically I > do not have any detection of: > > * Filenames do not match the project name (e.g. bar-1.0.tar.gz linked from > foo's page). > > * The file that is being linked to still exists at all (e.g. 404 or NXDOMAIN). > > * The file that is being linked to unpacks successfully and has a setup.py and > or other requirements to be a successfully installed package. > > * (pip specific) The file has a sane version number that follows PEP440 and/or > is not a pre-release. > > * It is unlikely that these numbers are accurate for any one particular > installer. In particular pip does not support .egg's but this detection does > however pip, and this detection, does support .whl's while setuptools does > not. > > The rules for detection are essentially: > > 1. Look at /simple/<foo>/ for that project. > 2. Look for any URL with a rel=internal and count it as an PyPI hosted file. > 3. Look for any URL that "looks" installable, this means that the path in the > URL ends with {.tar, .tar.gz, tar.bz2, .zip, .tgz, .egg, .whl} which also > has a #<hashname>=<hashvalue> fragment and count it as a externally hosted > file. > 4. Look for any URL that "looks" installable which does not have a hash URL > fragment and count it as an unsafely hosted file. > 5. Look for any URL that does not "look" installable which has a rel of > {download, homepage} and process them. > 6. Look at the HTML from #5 and look for URLs that look installable, with or > without a hash fragment and count it as an unsafely hosted file. > 7. Deduplicate the found filenames by ensuring that each filename exists for > a project only once, with the preference of PyPI > external > unsafe. > > > * In all places I've used PyPI to mean hosted on PyPI, external to mean hosted > externally and safely, and unsafely to mean hosted externally and unsafely. Oh, and Paul had asked before. Here’s the list of externally hosted projects: https://github.com/dstufft/pypi-external-stats/blob/master/2014-05-11/processed.json#L2-L69 And here’s the list of unsafely hosted projects: https://github.com/dstufft/pypi-external-stats/blob/master/2014-05-11/processed.json#L37852-L40829 The external1 and unsafe1 represents the <50% set and external2 and unsafe2 represents the >50% set. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Distutils-SIG maillist - [email protected] https://mail.python.org/mailman/listinfo/distutils-sig
