> On Mar 6, 2017, at 1:41 AM, Glyph Lefkowitz <[email protected]> wrote:
> 
> 
>> On Mar 5, 2017, at 8:51 AM, Donald Stufft <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> So, as most folks are aware PyPI has long had a cumulative download count 
>> available in it’s API. This has been on and off again broken for a *long* 
>> time and arguably the numbers in there have been “wrong” even when it was 
>> working because we had no way to reproduce them from scratch (and thus 
>> whenever a bug occurred we’d flat out lose data or add incorrect data with 
>> no way to correct it).
>> 
>> In the meantime, we’ve gotten a much better source of querying for download 
>> information available inside of Google’s BigQuery database [1][2]. Not only 
>> is this able to be recreated “from scratch” so we can, if needed, fix 
>> massive data bugs but it provides MUCH more information than the previous 
>> downloads and a very powerful query language to go along with it.
>> 
>> Unless there is some sort of massive outcry, I plan to deprecate and 
>> ultimately remove the download counts available in the PyPI API, instead 
>> preferring people to start using the BigQuery data instead. This more or 
>> less reflects the current state of things, since it has been on and off 
>> broken (typically broken) for something like a year now.
> 
> I fully realize that if I really wanted this, I could do it myself, and the 
> last thing you need is someone signing you up for more work :).  But, as 
> someone who's been vaguely annoyed that `vanity` doesn't work for a while, I 
> wonder: shouldn't it be easy for someone familiar with both systems to simply 
> implement the existing "download count" API as a legacy / compatibility 
> wrapper around BigQuery?  If that isn't trivial, doesn't that point to 
> something flawed in the way the data is presented in BigQuery?
> 
> That said, I'm fully OK with the answer that even a tiny bit of work is too 
> much, and the limited volunteer effort of PyPI should be spent elsewhere.
> 
> -glyph
> 


It’s not hard at all, it’d just be (standard SQL mode):

    SELECT file.filename, COUNT(*) AS downloads FROM `the-psf.pypi.downloads*` 
WHERE file.project = "twisted" GROUP BY file.filename

You can probably guess how to handle modifications to this query since it’s 
roughly just regular old SQL.

There are a few reasons I don’t want to just do this in PyPI though.

This query will take somewhere between 30 and 60 seconds to complete, so I 
can’t do it inline with the the HTTP request, and I’d need to  have a periodic 
job go through and issue about 100k queries (or a single query with almost a 
million results) and then load that into the database.

More importantly though, we don’t have an unlimited amount of BigQuery on PyPI. 
We get blocks of credits granted periodically and so the faster we use up 
“spend” the more regularly I have to track down my contacts inside of Google 
and get them to re-up the credit. This adds an incentive to to try and reduce 
our spending where we can to limit the frequency and the amount of time I need 
to go between asking for more credits. Due to BigQuery’s billing model you get 
billed based upon how much data your query has to process which means that a 
query that fetches data for all time, will be the most expensive kind of query 
and gets more expensive every day. 

On the flip side, the BigQuery data is publicly query able and the account 
being used to query “pays” for that query and every account gets 1TB of 
querying for free (and additional TBs are $5 per TB). Currently it takes ~215GB 
of data to do a “full” query for twisted (the exact query I listed above) and I 
haven’t fully backfilled all of the data yet (I’m working on it). You can kind 
of extrapolate that out to what it would “cost" to do that same query for all 
100k projects even before I do the backfill (which would drastically raise the 
“cost” of PyPI here).

The smart thing to do with BigQuery is to do date limited querying so that your 
query doesn’t have to load as much data. For instance, adapting the above query 
so that it only queries the last 30 days (still using standard SQL) you would 
do:

    SELECT file.filename, COUNT(*) AS downloads FROM `the-psf.pypi.downloads*` 
WHERE file.project = "twisted" AND _TABLE_SUFFIX BETWEEN FORMAT_DATE("%Y%m%d", 
DATE_ADD(CURRENT_DATE(), INTERVAL -31 day)) AND FORMAT_DATE("%Y%m%d", 
DATE_ADD(CURRENT_DATE(), INTERVAL -1 day)) GROUP BY file.filename 

This touches a much more reasonable 27GB of data.

For reference, we currently “spend” about $50/month on BigQuery so doing like, 
daily updates of this data for everyone would be a drastic increase in the 
amount of BigQuery spending we do.

So the tl;dr is I think it’s a better solution for vanity to talk to the 
BigQuery API itself, ideally limiting itself to a recent timeframe by default, 
and possibly adding a flag to get at the all time data for people who are OK 
with either using vanity less often or are willing to spend a couple bucks if 
they’re querying the full amount of data every day. Where Warehouse is starting 
to query BigQuery, I’m purposely limiting it to only the last N days (typically 
30) so as not to regularly query the entire data set.

—
Donald Stufft



_______________________________________________
Distutils-SIG maillist  -  [email protected]
https://mail.python.org/mailman/listinfo/distutils-sig

Reply via email to