Hi! I'm currently attending Wikimania (I have a session on Friday at 4.30pm).
Tilman Bayer suggested to share this tool and techniques here, so I am following his advice :). I've been using Google BigQuery for a while to analyze Wikipedia's publicly available data. It's main advantages: - It's unbelievable fast (try it - operations that you might expect to run in minutes or hours run in seconds). - It's secure, but you can also instantly share data (no need to download and setup locally before being able to analyze - BigQuery is always on). - Everyone can use BigQuery with a free quota of 1TB of monthly analysis. Interesting links: - Quick getting started: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/ - Analyzing the gender gap in Wikipedia (Freebase, and joining it with pageviews): https://www.youtube.com/watch?v=lV5vk3higvA - Massive Geo-Ip geolocation from the changelog: https://www.reddit.com/r/bigquery/comments/1zh7ty/massive_geoip_geolocation_with_google_bigquery/ - Just for fun, the most popular numbers: https://www.reddit.com/r/bigquery/comments/2p0vz4/query_of_the_day_the_most_popular_numbers_in/ - Top Wikipedia Entries Which Are Most-Edited by Members of the U.S. Congress http://minimaxir.com/2014/07/caucus-needed/ - Music recommendations: http://apassant.net/2014/07/11/music-recommendations-300m-data-points-sql/ I have a couple other interesting examples I haven't written about, but the invitation here is for you to try your own :). My main challenge today: How to get more publicly available data into BigQuery. Let's work together :). I'm sitting around the big data analytics team today at the Wikimedia hackathon - and as said earlier, I'll do a session on this topic on Friday at 4:30pm. Thanks!
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l