Nuria
> >I am hoping we can recover the garbled usernames from the raw JSON logs,
> Please have in mind that we have logs only from the last 90 days.
this is not true, we have server-side data covering the whole lifespan of the
latest ServerSideAccountCreation in /a/eventlogging/archive.
I appreciate that we need to enforce the 90-day deletion/pruning for a subset
of the logs, but we do have the raw data for SSAC and I do not expect this to
be a log that we will delete/prune (we will have to drop the userAgent field,
per the guidelines).
> Now, we shall be able to recover from the logs the user_names with character
> set utf-8. Note that the encoding issue does not apply only just to user
> names but actually to any string who can have a non-asciii value in all event
> logging schemas, not just this one.
That’s correct, see also my comment on
https://bugzilla.wikimedia.org/show_bug.cgi?id=66123
> See, for example, the following record from the logs:
>
> {"clientValidated": true, "event": {"campaign": "", "displayMobile": true,
> "isSelfMade": true, "returnTo":
> "\u062e\u0627\u0635:\u0645\u0631\u0641\u0648\u0639\u0627\u062a", "token": "",
> "userBuckets": "", "userId": 725222, "userName": "<removed>"}, "recvFrom":
> "mw1087", "revision": 5487345, "schema": "ServerSideAccountCreation",
> "seqId": 53258317, "timestamp": 1389610463, "uuid":
> "013953cf77a2585e983b491f2d4a2388", "webHost": "ar.wikipedia.org", "wiki":
> "arwiki"}
>
>
> Encoding in python2 is a notorious pain and hard to get right so to fixing
> this will mean not just "restoring" records from logs but also it involves
> changing database connection args, bindings and database types. Not a huge
> deal, but I just want to point out that fixing the issue goes beyond
> repopulating the records.
I appreciate that, however non-ASCII replaced with ? is creating a large amount
of artifacts in the data that at some point we’ll have to deal with. We should
figure out if repopulating historical data is a priority or we can live with
that and only fix future data.
Dario
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics