GoranSMilovanovic added a comment.

  @Addshore @Jan_Dittrich Here is the summary of the approach to collect the 
baseline data, following our today's meeting:
  
  **Step 1. Filter out revisions where the value of the statement is changed**
  
  - we will use the `wmf.mediawiki_history` table in the WMF Data Lake;
  - we filter out revisions by `event_comment` following @WMDE-leszek's 
approach: see task description and my experiment in T240466#5739380 
<https://phabricator.wikimedia.org/T240466#5739380>;
  - we look for parent revision IDs then because this approach indicates any 
change and not specifically a change in the value of a statement (thanks 
@Addshore for this observation);
  - we fetch the JSON representations of the two revisions (the target one + 
its parent revision) from https://www.wikidata.org/wiki/Special:EntityData/,
  - diff the JSONs and
  - sort out revisions where the value of a statement changed from those where 
something else happend.
  
  From **Step 1.** we have a table of `rev_ids` where a value in the statement 
changed. Now,
  
  **Step 2.** For each revision obtained in **Step 1.**,
  
  - we look for the subsequent N = 3 (a parameter whose value needs some 
experimentation) revisions of the same entity,
  - compare the JSON representations of the subsequent revisions with the 
original one
  - to see if the references of the revised statement had changed too or not.
  
  The ballpark numbers in this approach:
  
  - In **Step 1.** we collect the data until we have **approx. 200** tainted 
references recognized @Jan_Dittrich ;
  - we estimate the probability of obtaining a tainted reference from this 
sample of `wmf.mediawiki_history`;
  - In **Step 2.** we look at **three (3) revisions** following the one 
triggering the tainted reference to see
  - if the same user who triggered a tainted reference also revised the 
reference(s) of the statement, and
  - we estimate the probability of a spontaneous resolution of a tainted 
reference from these data.
  
  The initial experiment should let us learn better what parameter values 
(sampling, and how far to look for a change in references in the future 
revisions) to use.

TASK DETAIL
  https://phabricator.wikimedia.org/T240466

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Aklapper, Addshore, Jan_Dittrich, hoo, rosalieper, noarave, Tarrow, 
Lydia_Pintscher, GoranSMilovanovic, WMDE-leszek, Sarai-WMDE, darthmon_wmde, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Scott_WUaS, Wikidata-bugs, 
aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to