Dan –  thanks for the thorough update, hope you don’t mind if I repost this to 
the analytics list – I bet several people on this list are eager to know where 
this is going.

Dario

Begin forwarded message:
> 
> From: Milimetric <no-re...@phabricator.wikimedia.org>
> Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data 
> available in semi-publicly queryable database format
> Date: May 21, 2015 at 9:31:36 AM PDT
> To: da...@wikimedia.org
> Reply-To: t44259+public+a4a5010c21d15...@phabricator.wikimedia.org
> 
> Milimetric added a comment.
> 
> I'd love to start a more open discussion about our progress on this. Here's 
> the recent history and where we are:
> 
> February 2015: with data flowing into the Hadoop cluster, we defined which 
> raw webrequests were "page views". The research is here 
> <https://meta.wikimedia.org/wiki/Research:Page_view> and the code is here 
> <https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java>
> March 2015: we used this page view definition to create a raw pageview table 
> in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So 
> we don't have the resources to expose it publicly
> April 2015: we used this data internally to query but it overloaded our 
> cluster and queries were slow
> May 2015: we're working on an intermediate aggregation that would total up 
> page counts by hour over the dimensions that we think most people care about. 
> We estimate this will cut down size by a factor of 50
> Progress has been slow mostly because Event Logging is our main priority and 
> it's been having serious scaling issues. We think we have a good handle on 
> the Event Logging issues after our latest patch, and in a week or so we're 
> going to mostly focus on the Pageview API.
> 
> Once this new intermediate aggregation is done, we'll hopefully free up some 
> cluster resources and be in a better position to load up a public API. Right 
> now, we are evaluating two possible data pipelines:
> 
> Pipeline 1:
> 
> Put daily aggregates into PostgreSQL. We think per article hourly data would 
> be too big for PostgreSQL.
> Pipeline 2:
> 
> Query data from the Hive tables directly with Impala. Impala is good for 
> medium to small data, but is much faster than Hive. We might be able to query 
> the hourly data if we use this method.
> Common Pipeline after we make the choice above:
> 
> Mondrian builds OLAP cubes and handles caching which is very useful with this 
> much data
> point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. 
> This will be a reliable public API that people can build tools around
> point Saiku to Mondrian and make a new public website for exploratory 
> analytics. Saiku is an open source OLAP cube visualization and analysis tool
> Hope that helps. As we get closer to making this API real, we would love your 
> input, participation, questions, etc.
> 
> 
> TASK DETAIL
> https://phabricator.wikimedia.org/T44259 
> <https://phabricator.wikimedia.org/T44259>
> EMAIL PREFERENCES
> https://phabricator.wikimedia.org/settings/panel/emailpreferences/ 
> <https://phabricator.wikimedia.org/settings/panel/emailpreferences/>
> To: Milimetric
> Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, 
> scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, 
> Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to