Nuria added a subscriber: JAllemandou.
Nuria added a comment.

I think notes look good.

@mforns main point that I missed is that we probably also want to remove geolocation from dataset #1, I see that from your sumup you did.

Remaining item is sanitization of sparql queries and on that I think we have to trust your expertise. As in any system any non parseable queries should be removed cause -as we have seen before- bad queries might contain someone's credit card number (for real). From your notes you are also removing non parseable queries so, good again. I think also grouping user agent should be of use and not a privacy concern (as long as you only include broad categories. For example, we do not want to include: "user agent of the lunux distro only 3 people in the world have access to", this case is covered by you removing long tail of browsers with less than 10.000 requests.

Let's go ahead and start working on this, oozie/spark will be the way to go, since you already have tags on webrequest data you can probably run the job that will create this data once a day as you are only quering a subset of webrequest table ? (cc @JAllemandou to confirm). Here are some spark examples that might be of use to know how to generally approach the problem: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala


TASK DETAIL
https://phabricator.wikimedia.org/T143819

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: JAllemandou, mpopov, mforns, PokestarFan, Nuria, Lydia_Pintscher, mkroetzsch, leila, debt, thiemowmde, Jonas, Smalyshev, AndrewSu, Aklapper, I9606, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Avner, Gehel, FloNight, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, jeremyb
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to