Dan – thanks for the thorough update, hope you don’t mind if I repost this to the analytics list – I bet several people on this list are eager to know where this is going.
Dario Begin forwarded message: > > From: Milimetric <no-re...@phabricator.wikimedia.org> > Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data > available in semi-publicly queryable database format > Date: May 21, 2015 at 9:31:36 AM PDT > To: da...@wikimedia.org > Reply-To: t44259+public+a4a5010c21d15...@phabricator.wikimedia.org > > Milimetric added a comment. > > I'd love to start a more open discussion about our progress on this. Here's > the recent history and where we are: > > February 2015: with data flowing into the Hadoop cluster, we defined which > raw webrequests were "page views". The research is here > <https://meta.wikimedia.org/wiki/Research:Page_view> and the code is here > <https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java> > March 2015: we used this page view definition to create a raw pageview table > in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So > we don't have the resources to expose it publicly > April 2015: we used this data internally to query but it overloaded our > cluster and queries were slow > May 2015: we're working on an intermediate aggregation that would total up > page counts by hour over the dimensions that we think most people care about. > We estimate this will cut down size by a factor of 50 > Progress has been slow mostly because Event Logging is our main priority and > it's been having serious scaling issues. We think we have a good handle on > the Event Logging issues after our latest patch, and in a week or so we're > going to mostly focus on the Pageview API. > > Once this new intermediate aggregation is done, we'll hopefully free up some > cluster resources and be in a better position to load up a public API. Right > now, we are evaluating two possible data pipelines: > > Pipeline 1: > > Put daily aggregates into PostgreSQL. We think per article hourly data would > be too big for PostgreSQL. > Pipeline 2: > > Query data from the Hive tables directly with Impala. Impala is good for > medium to small data, but is much faster than Hive. We might be able to query > the hourly data if we use this method. > Common Pipeline after we make the choice above: > > Mondrian builds OLAP cubes and handles caching which is very useful with this > much data > point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. > This will be a reliable public API that people can build tools around > point Saiku to Mondrian and make a new public website for exploratory > analytics. Saiku is an open source OLAP cube visualization and analysis tool > Hope that helps. As we get closer to making this API real, we would love your > input, participation, questions, etc. > > > TASK DETAIL > https://phabricator.wikimedia.org/T44259 > <https://phabricator.wikimedia.org/T44259> > EMAIL PREFERENCES > https://phabricator.wikimedia.org/settings/panel/emailpreferences/ > <https://phabricator.wikimedia.org/settings/panel/emailpreferences/> > To: Milimetric > Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, > scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, > Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics