Isaac added a comment.
I slightly tweaked the model but also experimented with adding just a simple square-root of the number of existing claims to the model and found that that is essentially that's all that is needed to almost match ORES quality (which is near perfect) for predicting item quality. That said, I think this is mainly an issue with the assessment data as opposed to Wikidata quality really just being about the number of statements. For example, the dataset has many Wikidata items that are for disambiguation pages and they're almost all rated E-class (lowest) because their only property is their instance-of. I'd argue though that that's perfectly acceptable for almost all disambiguation pages and these items are nearly complete even with just that one property (you can see the frequency of other properties that occur for these pages but they're pretty low: https://recoin.toolforge.org/getbyclassid.php?subject=Q4167410&n=200). So while the number of claims is a useful feature for matching human perception of quality, I think we'd actually want to leave it out to get closer to the concept of "to what degree is an item missing major information". Where most disambiguation pages would do just fine here but human items that have many more statements (but also a much higher expectation) wouldn't do as well. Notebook: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/v2_eval_wikidata_quality_model.ipynb Quick summary: 38.7% correct (62.6% within 1 class) using features ['label_s']. 56.7% correct (77.0% within 1 class) using features ['claim_s']. 44.8% correct (72.7% within 1 class) using features ['ref_s']. 77.3% correct (98.1% within 1 class) using features ['sqrt_num_claims']. 55.0% correct (75.3% within 1 class) using features ['label_s', 'claim_s']. 50.2% correct (74.5% within 1 class) using features ['label_s', 'ref_s']. 76.5% correct (98.4% within 1 class) using features ['label_s', 'sqrt_num_claims']. 54.2% correct (76.6% within 1 class) using features ['label_s', 'claim_s', 'ref_s']. 75.1% correct (98.3% within 1 class) using features ['label_s', 'claim_s', 'sqrt_num_claims']. 79.4% correct (97.7% within 1 class) using features ['label_s', 'ref_s', 'sqrt_num_claims']. 55.0% correct (78.4% within 1 class) using features ['claim_s', 'ref_s']. 75.3% correct (98.0% within 1 class) using features ['claim_s', 'sqrt_num_claims']. 78.8% correct (98.3% within 1 class) using features ['claim_s', 'ref_s', 'sqrt_num_claims']. 79.4% correct (98.7% within 1 class) using features ['ref_s', 'sqrt_num_claims']. 78.3% correct (97.9% within 1 class) using features ['label_s', 'claim_s', 'ref_s', 'sqrt_num_claims'] ORES is at (remembering it's trained on 2x more data including what I'm evaluating it on here): 87.1% correct and 98.3% within 1 class TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org