Re: [Wiki-research-l] analyzing 50 billion Wikipedia pageviews in 5 seconds (w/BigQuery)

2015-07-16 Thread Tilman Bayer
Thanks Felipe! Yes, I think this is a really interesting tool to explore.

Another quick example:
List articles attract between 2-3% of pageviews on the English Wikipedia:

SELECT SUM(requests)
FROM [fh-bigquery:wikipedia.pagecounts_20150715_14]
WHERE LEFT(TITLE, 8) = 'List_of_'
AND language = 'en'

244091

SELECT SUM(requests)
FROM [fh-bigquery:wikipedia.pagecounts_20150715_14]
WHERE language = 'en'

8870277

(Caveats: during one hour this Wednesday, using the old pageview
definition, i.e. not excluding spiders and bots, and relying on the
article name instead of categories.)

I understand Felipe has already been talking to the WMF Analytics
team, who are making major progress on
https://phabricator.wikimedia.org/T44259 currently.

On Thu, Jul 16, 2015 at 11:32 AM, Felipe Hoffa  wrote:
> Hi! I'm currently attending Wikimania (I have a session on Friday at
> 4.30pm).
>
> Tilman Bayer suggested to share this tool and techniques here, so I am
> following his advice :).
>
> I've been using Google BigQuery for a while to analyze Wikipedia's publicly
> available data. It's main advantages:
>
> - It's unbelievable fast (try it - operations that you might expect to run
> in minutes or hours run in seconds).
> - It's secure, but you can also instantly share data (no need to download
> and setup locally before being able to analyze - BigQuery is always on).
> - Everyone can use BigQuery with a free quota of 1TB of monthly analysis.
>
>
> Interesting links:
>
> - Quick getting started:
> https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/
> - Analyzing the gender gap in Wikipedia (Freebase, and joining it with
> pageviews): https://www.youtube.com/watch?v=lV5vk3higvA
> - Massive Geo-Ip geolocation from the changelog:
> https://www.reddit.com/r/bigquery/comments/1zh7ty/massive_geoip_geolocation_with_google_bigquery/
> - Just for fun, the most popular numbers:
> https://www.reddit.com/r/bigquery/comments/2p0vz4/query_of_the_day_the_most_popular_numbers_in/
> - Top Wikipedia Entries Which Are Most-Edited by Members of the U.S.
> Congress http://minimaxir.com/2014/07/caucus-needed/
> - Music recommendations:
> http://apassant.net/2014/07/11/music-recommendations-300m-data-points-sql/
>
>
> I have a couple other interesting examples I haven't written about, but the
> invitation here is for you to try your own :).
>
> My main challenge today: How to get more publicly available data into
> BigQuery. Let's work together :). I'm sitting around the big data analytics
> team today at the Wikimedia hackathon - and as said earlier, I'll do a
> session on this topic on Friday at 4:30pm.
>
> Thanks!
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] analyzing 50 billion Wikipedia pageviews in 5 seconds (w/BigQuery)

2015-07-16 Thread Felipe Hoffa
Hi! I'm currently attending Wikimania (I have a session on Friday at
4.30pm).

Tilman Bayer suggested to share this tool and techniques here, so I am
following his advice :).

I've been using Google BigQuery for a while to analyze Wikipedia's publicly
available data. It's main advantages:

- It's unbelievable fast (try it - operations that you might expect to run
in minutes or hours run in seconds).
- It's secure, but you can also instantly share data (no need to download
and setup locally before being able to analyze - BigQuery is always on).
- Everyone can use BigQuery with a free quota of 1TB of monthly analysis.


Interesting links:

- Quick getting started:
https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/
- Analyzing the gender gap in Wikipedia (Freebase, and joining it with
pageviews): https://www.youtube.com/watch?v=lV5vk3higvA
- Massive Geo-Ip geolocation from the changelog:
https://www.reddit.com/r/bigquery/comments/1zh7ty/massive_geoip_geolocation_with_google_bigquery/
- Just for fun, the most popular numbers:
https://www.reddit.com/r/bigquery/comments/2p0vz4/query_of_the_day_the_most_popular_numbers_in/
- Top Wikipedia Entries Which Are Most-Edited by Members of the U.S.
Congress http://minimaxir.com/2014/07/caucus-needed/
- Music recommendations:
http://apassant.net/2014/07/11/music-recommendations-300m-data-points-sql/


I have a couple other interesting examples I haven't written about, but the
invitation here is for you to try your own :).

My main challenge today: How to get more publicly available data into
BigQuery. Let's work together :). I'm sitting around the big data analytics
team today at the Wikimedia hackathon - and as said earlier, I'll do a
session on this topic on Friday at 4:30pm.

Thanks!
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l