> On Apr 21, 2015, at 11:35 PM, Gregory P. Smith <g...@krypto.org> wrote:
> 
> 
> 
> On Tue, Apr 21, 2015 at 10:55 AM Donald Stufft <don...@stufft.io 
> <mailto:don...@stufft.io>> wrote:
> Just thought I'd share this since it shows how what people are using to
> download things from PyPI have changed over the past year. Of particular
> interest to most people will be the final graphs showing what percentage of
> downloads from PyPI are for Python 3.x or 2.x.
> 
> As always it's good to keep in mind, "Lies, Damn Lies, and Statistics". I've
> tried not to bias the results too much, but some bias is unavoidable. Of
> particular note is that a lot of these numbers come from pip, and as of 
> version
> 6.0 of pip, pip will cache downloads by default. This would mean that older
> versions of pip are more likely to "inflate" the downloads than newer versions
> since they don't cache by default. In addition if a project has a file which
> is used for both 2.x and 3.x and they do a ``pip install`` on the 2.x version
> first then it will show up as counted under 2.x but not 3.x due to caching 
> (and
> of course the inverse is true, if they install on 3.x first it won't show up
> on 2.x).
> 
> Here's the link: https://caremad.io/2015/04/a-year-of-pypi-downloads/ 
> <https://caremad.io/2015/04/a-year-of-pypi-downloads/>
> 
> Anyways, I'll have access to the data set for another day or two before I
> shut down the (expensive) server that I have to use to crunch the numbers so 
> if
> there's anything anyone else wants to see before I shut it down, speak up 
> soon.
> 
> Thanks!
> 
> I like your focus on particular packages of note such as django and requests.
> 
> How do CDNs influence these "lies"?  I thought the download counts on PyPI 
> were effectively meaningless due to CDN mirrors fetching and hosting things?
> 
> Do we have user-agent logs from all PyPI package CDN mirrors or just from the 
> master?
> 
> -gps


We took the download counts offline for awhile because of the CDN, however 
within a month or two (now almost two years ago) they enabled logs on our 
account to bring them back. So these numbers are from the CDN edge and they 
reflect the “true” traffic. I say “true” because although we have logs, logging 
isn’t considered an essential service so in times of problems logging can be 
reduced or disabled completely (you can see in the data set some weeks had a 
massive drop, this was due to missing a day or two of logs).

That being said though, ontop of the Fastly provided CDN, there is also the 
ability to mirror PyPI (which shows up as bandersnatch or others in the logs) 
and if someone is installing from a mirror we don’t see that data at all. On 
top of that, all versions of pip prior to 6.0 had an opt in download cache 
which would mean that, on an opt in basis, we wouldn’t see downloads for those 
people and since 6.0 there is now an opt-out cache.

Specifically to the mirror network itself, that represents about 20% of the 
total traffic on PyPI, however we can determine when it was a mirror and those 
downloads show up as “Unknown” in other charts since it’s a mirror client we 
don’t know what the final target environment will be.

This might mean that future snapshots will look at API accesses instead, or 
perhaps we try to implement some sort of optional popcon or maybe we continue 
to look at package installs and we just interpret the data with the knowledge 
that these things are at play.

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to