Re: [Analytics] Data quality issues with account creation log

2014-06-13 Thread Dario Taraborelli
+1 On Jun 13, 2014, at 6:15 AM, Aaron Halfaker ahalfa...@wikimedia.org wrote: As a data consumer, I'd prefer if columns matched between EventLogging and production DBs as closely as possible, so VARBINARY sounds like a win to me. On Fri, Jun 13, 2014 at 5:39 AM, Nuria Ruiz

Re: [Analytics] Data quality issues with account creation log

2014-06-09 Thread Sean Pringle
On Fri, Jun 6, 2014 at 6:30 PM, Nuria Ruiz nu...@wikimedia.org wrote: Encoding in python2 is a notorious pain and hard to get right so to fixing this will mean not just restoring records from logs but also it involves changing database connection args, bindings and database types. Just to

Re: [Analytics] Data quality issues with account creation log

2014-06-09 Thread Sean Pringle
On Tue, Jun 10, 2014 at 1:04 AM, Nuria Ruiz nu...@wikimedia.org wrote: Just to narrow this down a little further from the DB server-side: the eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types).

Re: [Analytics] Data quality issues with account creation log

2014-06-09 Thread Ori Livneh
On Mon, Jun 9, 2014 at 8:00 PM, Sean Pringle sprin...@wikimedia.org wrote: On Tue, Jun 10, 2014 at 1:04 AM, Nuria Ruiz nu...@wikimedia.org wrote: Just to narrow this down a little further from the DB server-side: the eventlogging tables do use utf-8, so the fix probably doesn't require

Re: [Analytics] Data quality issues with account creation log

2014-06-09 Thread Sean Pringle
On Tue, Jun 10, 2014 at 1:12 PM, Ori Livneh o...@wikimedia.org wrote: ...and back to utf8 as default charset The version of MySQLdb that is packaged for Precise does not know about utf8mb4. I (inexcusably) tested against the dev branch of MySQLdb. Bet that was a fun day :)

Re: [Analytics] Data quality issues with account creation log

2014-06-06 Thread Nuria Ruiz
If someone could document the reasons why the userName is needed on this schema it will be great. They can be documented on the schema talk page: http://meta.wikimedia.org/wiki/Schema_talk:ServerSideAccountCreation When I looked at this issue early on it was not at all obvious to me why - if you

Re: [Analytics] Data quality issues with account creation log

2014-06-06 Thread Dario Taraborelli
Nuria I am hoping we can recover the garbled usernames from the raw JSON logs, Please have in mind that we have logs only from the last 90 days. this is not true, we have server-side data covering the whole lifespan of the latest ServerSideAccountCreation in /a/eventlogging/archive. I

Re: [Analytics] Data quality issues with account creation log

2014-06-05 Thread Aaron Halfaker
Regretfully, looking up a user in Centralauth requires the use of a username. Then again, you'd need to join with a user table (with user_id) anyway since users can be renamed after they create their account and that name change won't be reflected in ServerSideAccountCreation. On Thu, Jun 5,

Re: [Analytics] Data quality issues with account creation log

2014-06-05 Thread Dario Taraborelli
I am hoping we can recover the garbled usernames from the raw JSON logs, but you’re correct about username changes. For project level counts, though, they should not dramatically affect the accuracy of new registration numbers. On Jun 5, 2014, at 3:51 PM, Aaron Halfaker ahalfa...@wikimedia.org

Re: [Analytics] Data quality issues with account creation log

2014-06-05 Thread Dario Taraborelli
and yes, I wish we had a gu_id included in ServerSideAccountCreation (assuming MediaWiki knows it by the time the event is generated) On Jun 5, 2014, at 4:39 PM, Dario Taraborelli da...@wikimedia.org wrote: I am hoping we can recover the garbled usernames from the raw JSON logs, but you’re