Re: [Analytics] Pageview_hourly dataset. Preventing Identity reconstruction

2015-09-28 Thread Tilman Bayer
Hi Pine, this mailing list is about analytics, and this thread is about the pageview_hourly dataset. There might better venues for your questions about the Checkuser extension (not an analytics tool per se) and the data that it stores. Perhaps start by first reading the relevant parts of https://w

Re: [Analytics] Wikimedia Dumps Behind? 9/16 and 9/17

2015-09-28 Thread Andrew Otto
Yup! The hourly jobs that generate these are individually tracked. We usually notice if there are problems during a working day. Pinging us here on this list is a good way to bump us into faster action too. I just looked now, and those files do exist. > On Sep 28, 2015, at 00:22, Tony Ho w

Re: [Analytics] Pageview_hourly dataset. Preventing Identity reconstruction

2015-09-28 Thread Pine W
Hi Nuria, OK, so the useragent data for edits is stored in a different database, is heavily sampled when used for research, and will still be accessible for CU use if user_agent_map is removed from the pageview_hourly data, right? On Mon, Sep 28, 2015 at 10:48 AM, Nuria Ruiz wrote: > Pine: > >

Re: [Analytics] Pageview_hourly dataset. Preventing Identity reconstruction

2015-09-28 Thread Nuria Ruiz
Pine: The pageview_hourly dataset on hive contains pageviews, not edits. The majority of data for edits is not associated to a user-agent as it is stored on mediawiki database. Some of it comes via Eventlogging as experiments are run in, for example, visual editor. This second venue of data is of

Re: [Analytics] Pageview_hourly dataset. Preventing Identity reconstruction

2015-09-28 Thread Pine W
Hi Nuria, Thanks for wirking on this. Removing user_agent_map would be only for readership data, correct? Would this data still be stored for edits, and if so, for how long? Pine On Sep 28, 2015 7:16 AM, "Nuria Ruiz" wrote: > Hello, > > We have been working on the exercise of reconstructing an

[Analytics] Pageview_hourly dataset. Preventing Identity reconstruction

2015-09-28 Thread Nuria Ruiz
Hello, We have been working on the exercise of reconstructing an identity using the (still private) pageview_hourly dataset ( https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly) TL;DR It is possible (and easy) to do that with the fields the dataset has now, before releasing it pub

[Analytics] Tackling the Analytics/wikistats Gerrit backlog

2015-09-28 Thread Andre Klapper
Hi Analytics, your input on the analytics/wikistats task in https://phabricator.wikimedia.org/T113695 is welcome to find the best way in the next weeks how to move forward. Who could try to tackle this? Thanks in advance for your help! andre -- Andre Klapper | Wikimedia Bugwrangler http://bl