GoranSMilovanovic added a comment.
@Addshore @Jan_Dittrich Here is the summary of the approach to collect the baseline data, following our today's meeting: **Step 1. Filter out revisions where the value of the statement is changed** - we will use the `wmf.mediawiki_history` table in the WMF Data Lake; - we filter out revisions by `event_comment` following @WMDE-leszek's approach: see task description and my experiment in T240466#5739380 <https://phabricator.wikimedia.org/T240466#5739380>; - we look for parent revision IDs then because this approach indicates any change and not specifically a change in the value of a statement (thanks @Addshore for this observation); - we fetch the JSON representations of the two revisions (the target one + its parent revision) from https://www.wikidata.org/wiki/Special:EntityData/, - diff the JSONs and - sort out revisions where the value of a statement changed from those where something else happend. From **Step 1.** we have a table of `rev_ids` where a value in the statement changed. Now, **Step 2.** For each revision obtained in **Step 1.**, - we look for the subsequent N = 3 (a parameter whose value needs some experimentation) revisions of the same entity, - compare the JSON representations of the subsequent revisions with the original one - to see if the references of the revised statement had changed too or not. The ballpark numbers in this approach: - In **Step 1.** we collect the data until we have **approx. 200** tainted references recognized @Jan_Dittrich ; - we estimate the probability of obtaining a tainted reference from this sample of `wmf.mediawiki_history`; - In **Step 2.** we look at **three (3) revisions** following the one triggering the tainted reference to see - if the same user who triggered a tainted reference also revised the reference(s) of the statement, and - we estimate the probability of a spontaneous resolution of a tainted reference from these data. The initial experiment should let us learn better what parameter values (sampling, and how far to look for a change in references in the future revisions) to use. TASK DETAIL https://phabricator.wikimedia.org/T240466 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: Aklapper, Addshore, Jan_Dittrich, hoo, rosalieper, noarave, Tarrow, Lydia_Pintscher, GoranSMilovanovic, WMDE-leszek, Sarai-WMDE, darthmon_wmde, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs