[Distutils] Publicly Queryable Statistics

Donald Stufft Sat, 21 May 2016 11:22:24 -0700

Hey,

One thing I’ve been working on as part of Warehouse, is a subproject that I 
call “Linehaul”. This is essentially a little statistics daemon that will take 
specially formatted syslog messages coming off of Fastly and shove them inside 
of a BigQuery database. I’m happy to report that I’ve just finished the 
production deployment of this and we’re now sending every download event that 
hits Fastly into BigQuery.


First off, I’d like to thank Felipe Hoffa, Will Curran, and Preston Holmes over 
at Google for helping me get credits for this sorted out so that we can 
actually get this going! They’ve been a big help.

So onto what this means.

Basically, BigQuery gives us the ability to relatively quickly (typically < 
60s) query very large datasets using something that is very similar to SQL. 
Unlike typical time series databases we don’t have to know ahead of time what 
we want to query on, we can just insert data into rows in a table (and our 
tables are sharded by days) and then using the SQLlike query language, you can 
do any sort of query you like.

On top of all of this, BigQuery gives me the ability to share the dataset 
publicly with anyone who is logged into a Google account, which means that 
*anyone* can query this data and look for any sort of interesting information 
they can find in it. The cost of any queries you run will be associated with 
your own account (but the first 1TB of data a month that you query is free I 
believe, nor are you charged for queries that error out or return cached 
results).

Anyways, you can query these BigQuery tables whose names match the pattern 
`the-psf:pypi.downloadsYYYYMMDD` and you can see whatever data you want.

We’ve only just started recorded the data in this spot so right now there isn’t 
a whole lot of data available (but it’s constantly streaming in). Once the 22nd 
rolls over I’m going to delete what data we have available for the 21st, and 
then start backfilling historical data starting with the 21st and going 
backwards. You should be able to run queries in a Web UI by navigating to 
https://bigquery.cloud.google.com/dataset/the-psf:pypi (you might have to 
accept a Terms of Service).

The table schema looks like: https://s.caremad.io/lPpTF6rxWZ/ but it should 
also be visible on the big query page. Some example queries you might want to 
run are located at https://gist.github.com/alex/4f100a9592b05e9b4d63 (but note, 
that’s currently using the *old* not publicly available table name, you’ll need 
to replace [long-stack-762:pypi.downloads] with [the-psf.pypi:downloads]).
If you want to write your own queries, you should be able to find the syntax 
here: https://cloud.google.com/bigquery/query-reference

Anyways, new data should constantly be streaming in, and I should be able to 
backfill data all the way to Jan of 2014 or so. Hopefully this is useful to 
folks, and if you find any interesting queries or numbers, please share them!

—
Donald Stufft



_______________________________________________
Distutils-SIG maillist  -  [email protected]
https://mail.python.org/mailman/listinfo/distutils-sig

[Distutils] Publicly Queryable Statistics

Reply via email to