Nuria

> >I am hoping we can recover the garbled usernames from the raw JSON logs, 
> Please have in mind that we have logs only from the last 90 days. 

this is not true, we have server-side data covering the whole lifespan of the 
latest ServerSideAccountCreation in /a/eventlogging/archive.
I appreciate that we need to enforce the 90-day deletion/pruning for a subset 
of the logs, but we do have the raw data for SSAC and I do not expect this to 
be a log that we will delete/prune (we will have to drop the userAgent field, 
per the guidelines).

> Now, we shall be able to recover from the logs the user_names with character 
> set utf-8. Note that the encoding issue does not apply only just to user 
> names but actually to any string who can have a non-asciii value in all event 
> logging schemas, not just this one.

That’s correct, see also my comment on 
https://bugzilla.wikimedia.org/show_bug.cgi?id=66123

> See, for example, the following record from the logs:
> 
> {"clientValidated": true, "event": {"campaign": "", "displayMobile": true, 
> "isSelfMade": true, "returnTo": 
> "\u062e\u0627\u0635:\u0645\u0631\u0641\u0648\u0639\u0627\u062a", "token": "", 
> "userBuckets": "", "userId": 725222, "userName": "<removed>"}, "recvFrom": 
> "mw1087", "revision": 5487345, "schema": "ServerSideAccountCreation", 
> "seqId": 53258317, "timestamp": 1389610463, "uuid": 
> "013953cf77a2585e983b491f2d4a2388", "webHost": "ar.wikipedia.org", "wiki": 
> "arwiki"}
> 
> 
> Encoding in python2 is a notorious pain and hard to get right so to fixing 
> this will mean not just "restoring" records from logs but also it involves 
> changing database connection args, bindings and database types. Not a huge 
> deal, but I just want to point out that fixing the issue goes beyond 
> repopulating the records.

I appreciate that, however non-ASCII replaced with ? is creating a large amount 
of artifacts in the data that at some point we’ll have to deal with. We should 
figure out if repopulating historical data is a priority or we can live with 
that and only fix future data.

Dario
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to