[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. I'm going to be out the next several weeks so FYI likely won't hear updates until mid-September on this. Thanks for these additional details though! > Now there are several Properties that can represent such relations. The main ones we should probably focus on are instance of, subclass of and part of as explained on https://www.wikidata.org/wiki/Help:Basic_membership_properties. Everything is currently based on instance-of values but looks like I need to also allow `subclass of` and `part of`. The tricky thing there is that I assume most `subclass of` and `part of` statements are pretty rare -- e.g., there are only so many items with `subclass of` for `physicist` -- so it's hard to learn the expectations for `subclass of physicist` and my best bet is probably a more generic set of expected properties for any item that has a `subclass of` property regardless of its value (and same for `part of`). I'll have to see how consistent these properties are though because if they're highly specific to the value of `subclass of`, the model won't be able to do anything useful with them. I'm hopeful that this small change though will fix these various outliers and get us to a place where we can reasonably test the verify that the model is doing what we expect. TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Danny_Benjafield_WMDE, KinneretG, Astuthiodit_1, YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. > That's quite an interesting table! Would it be possible to get the actual Item IDs for the last two rows? It could be instructive to know which Items the model thinks are very incomplete but have excellent quality :) @Michael thanks for the questions! Some context: I think the completeness model is better suited for evaluating items (it's much more nuanced than the quality model, which largely just takes into consideration the number of statements an item has). This analysis hopefully will do two things: 1) help us find some places where the completeness model doesn't do great and we could tweak it, and, 2) build a sample of items to give to Wikidata experts to ensure that the completeness model is in fact capturing their expectations better than the quality model. Looking into the extreme ends of the data, most of the 241 that were low completeness and high quality are items that have many statements but lack an instance-of property that I use for categorizing an item to determine which properties it should have. I assume that it is okay for items to lack an instance-of if they're subclasses of another item? Perhaps I can make a special case for items that lack an instance-of but do have a subclass-of property? Though in checking a few examples, I don't know if there is a particularly consistent set of expectations around which properties should exist for these items. Example: Q22698 (park) <https://www.wikidata.org/wiki/Q22698>. The other set are items like Q7473516 (Tokyo) <https://www.wikidata.org/wiki/Q7473516>, which have a bunch of statements but are lacking references and also have a bunch of instance-ofs so missing some expected statements too. Data -- instead of the labels, I'm outputting the raw scores which range from 0 (very bad) to 1 (very good) for both the individual features and the overall completeness/quality scores. Number of statements is what it sounds like. +++---++--+---+--+ |item|claims_score|refs_score |labels_score|num_statements|completeness_score |quality_score | +++---++--+---+--+ |https://www.wikidata.org/wiki/Q907112 |0.37053642 |0.0948047 |0.65625 |102.0 |0.334656685590744 |1.0 | |https://www.wikidata.org/wiki/Q34754|0.36810225 |0.15446919 |0.60655737 |80.0 |0.3423793315887451 |0.9698982238769531| |https://www.wikidata.org/wiki/Q23427|0.36427906 |0.20430791 |0.597 |75.0 |0.35162511467933655|0.9493570327758789| |https://www.wikidata.org/wiki/Q170174 |0.3615961 |0.16795586 |0.653 |62.0 |0.34746548533439636|0.8789964914321899| |https://www.wikidata.org/wiki/Q43287|0.3493883 |0.16088052 |0.639 |70.0 |0.33629995584487915|0.9208924174308777| |https://www.wikidata.org/wiki/Q7473516 |0.34005976 |0.13500817 |0.734375 |59.0 |0.33547067642211914|0.8670973181724548| |https://www.wikidata.org/wiki/Q28179|0.33763435 |0.12792718 |0.6805556 |58.0 |0.325609028339386 |0.8516228199005127| |https://www.wikidata.org/wiki/Q12280|0.33676738 |0.21568704 |0.6069182 |85.0 |0.3385910391807556 |1.0 | |https://www.wikidata.org/wiki/Q40362|0.33630428 |0.13121563 |0.60952383 |70.0 |0.31699275970458984|0.9111775159835815| |https://www.wikidata.org/wiki/Q81931|0.33004636 |0.26839557 |0.6395349 |85.0 |0.3518647849559784 |1.0 | |https://www.wikidata.org/wiki/Q7318 |0.32431 |0.124200575|0.6560284 |61.0 |0.31338077783584595|0.8646677136421204| |https://www.wikidata.org/wiki/Q133356 |0.32018384 |0.105077215|0.6369048 |62.0 |0.30359283089637756|0.8645642399787903| |https://www.wikidata.org/wiki/Q39473|0.30660433 |0.13555892 |0.6276 |66.0 |0.30183497071266174|0.890287458896637 | |https://www.wikidata.org/wiki/Q4948 |0.30622554 |0.16706112 |0.59638554 |64.0 |0.3058503270149231 |0.878718376159668 | |https://www.wikidata.org/wiki/Q35666|0.30606884 |0.30240446 |0.56299216 |76.0 |0.33634650707244873|0.9609817862510681| |https://www.wikidata.org/wiki/Q5684 |0.2750341 |0.24169385 |0.6298077 |70.0 |0.3096027374267578 |0.9274186491966248| |https://www.wikidata.org/wiki/Q170468 |0.27139774 |0.14692228 |0.6081081 |67.0 |0.2804390490055084 |0.8926790356636047| |https://www.wikidata.org/wiki/Q180573 |0.26850355 |0.12669751 |0.6474359 |58.0 |0.2782377600669861 |0.8423773050308228| |https://www.wikidat
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Oooh and the job worked! High-level data on overlap between the two scores where they are the same except completeness just takes into account how many of the expected claims/refs/labels are there and quality adds the total number of claims to the features too: +--+-+-+ |completeness_label|quality_label|num_items| +--+-+-+ |D |D|29955491 | |A |C|28315614 | |A |B|14986978 | |D |C|11287166 | |E |D|6428229 | |E |E|4929743 | |A |D|3697974 | |D |E|1760575 | |D |B|1361759 | |D |A|207834 | |E |C|55665| |A |A|45423| |E |B|2087 | |E |A|241 | |A |E|6| +--+-+-+ TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, KinneretG, Astuthiodit_1, YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Updates: - Finally ported all the code from the API to work on the cluster. I don't know if it'll run to completeness yet but I ran it on a subset and the results largely matched the API: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/annotation-gap/wikidata-completeness.ipynb - Notably, I got rid of the statsmodel ordinal logistic regression dependency which was painful and just take the parameters/thresholds from the model and do the math myself. - Next step will be running this fully or on a sample of data and then choosing a sample of items to provide to raters to compare the scores and choose whether the quality or completeness models seems to best capture the concept of "this Wikidata item is in good shape". TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, KinneretG, Astuthiodit_1, YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Updates: - Wrestling with re-adapting everything to the cluster but making good progress. One of the main challenges is that the wikidata item schema is different between cluster and API so lots of little errors that I'm having to discover and correct as I make that adaptation in source data. TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, KinneretG, Astuthiodit_1, YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Updates: - Successfully generated the property data I need so now I have the necessary data to run the model in bulk on the cluster and can turn towards generating a dataset for sampling. Notebook: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/annotation-gap/generate_wikidata_propertyfreq_data.ipynb TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, KinneretG, Astuthiodit_1, YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Updates: - Began process of regenerating property-frequency table on cluster given that we shouldn't depend on RECOIN for bulk computation even if it greatly simplifies the API prototype. Working out a few bugs but feel like I have the right approach and relatively simple. TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. No updates still with prep for wikiworkshop/hackathon but after next week, hoping to get back to this! TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. From discussion with Lydia/Diego: - The concept of `completeness` feels closer to what we want than `quality` -- i.e. allowing for more nuance in how many statements are associated with a given item. We came up with a few ideas for how to make assessing item completeness easier (because otherwise it would require very extensive knowledge of a domain area to know how many statements should be associated with an item): I suggested providing the completeness score and quality score and asking the evaluator which was more appropriate but I like Lydia's idea better which was to just provide the completeness score and ask the evaluator if they felt that the actual score was lower, the same, or higher. - Putting together a dataset like this would be fairly straightforward -- the main challenge is having a nice stratified dataset and one that provides information on top of the original quality-oriented dataset. For example, for highly-extensive items, both models tend to agree that the item is A-class so collecting a lot more annotations won't tell us much. It's only for the shorter items where we begin to see discrepancies and so that's where we should probably focus our efforts. Plus because the model is very specific to the instance-of/occupation properties, we should make sure to have a diversity of items by those properties. This is my main TODO. - I read through the paper <https://link.springer.com/chapter/10.1007/978-3-030-49461-2_11> describing the new proposed Wikidata Property Suggester approach. My understanding of the existing item-completeness/recommender systems: - Existing Wikidata Property Suggester: make recommendations for properties to add based on statistics on co-occurrence of properties. Ignores values of these properties except for instance-of/subclass-of where the statistics are based on the value. Recommendations are ranked by probability of co-occurrence. - Recoin: similar to above but only uses instance-of property for determining missing properties and adds in refinement of which occupation the item has if it's a human. - Proposed Wikidata Property Suggester: more advanced system for finding likely co-occurring properties based on more fine-grained association rules -- i.e. doesn't just merge all the individual "if Property A -> Property B k% of the time" but instead does things like 'if Property A and Property B and ... -> Property N k% of the time". Also takes into account instance-of/subclass-of property values like the existing suggester. This seems like a pretty reasonable enhancement and their approach is quite lightweight (~1.5GB RAM for holding data structure). - I am following the Recoin approach in my model though if the new Property Suggester proves successful and provides the data needed to incorporate into the model (a list of likely missing properties + confidence scores), it would be very reasonable to incorporate that in in place of the Recoin model at a later point and also solve some of the problems that @diego was considering addressing via wikidata embeddings (more nuanced recommendations of missing properties). TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Updated API to be slightly more robust to instance-of-only edge cases and provide the individual features. Output for https://wikidata-quality.wmcloud.org/api/item-scores?qid=Q67559155: { "item": "https://www.wikidata.org/wiki/Q67559155";, "features": { "ref-completeness": 0.9055531797461024, "claim-completeness": 0.903502532415779, "label-desc-completeness": 1.0, "num-claims": 11 }, "predicted-completeness": "A", "predicted-quality": "C" } Details: - `ref-completeness`: what proportion of expected references does the item have? References that are internal to Wikimedia are only given half-credit while external links / identifiers are given full credit. Based on what proportion of claims for a given property typically have references on Wikidata. Also takes into account missing statements. - `claim-completeness`: what proportion of the expected claims does the item have. Data taken from Recoin <https://www.wikidata.org/wiki/Wikidata:Recoin> where less common properties for a given instance-of are weighted less. - `label-desc-completeness`: what proportion of expected labels/descriptions are present. Right now the expected labels/descriptions are English plus any language for which the item has a sitelink. - `num-claims`: how many total properties the item has actually so it's a misnomer and something I'll fix at some point (I don't give more credit for e.g., having 3 authors instead of 1 author for a scientific paper) - `predicted-completeness`: E (worst) to A (best) based on (see guidelines <https://www.wikidata.org/wiki/Wikidata:Item_quality>), which uses just the proportional `*-completeness` features. - `predicted-quality`: same classes but now also includes the more generic `num-claims` feature too. Regarding T332021 <https://phabricator.wikimedia.org/T332021>, I'll have to think about how to count that for the label-desc score. Probably no change for descriptions but for labels, perhaps accept it in place of English but still expect language-specific labels for any languages that have a sitelink? Either way, label/descriptions are not a major feature so it won't greatly affect the model. TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. I still need to do some checks because I know e.g., this fails when the item lacks statements, but I put together an API for testing the model. It has two outputs: a quality class (E worst to A best) that uses the number of claims on the item as a feature (along with labels/refs/claims completeness) and corresponds very closely to ORES model outputs and the annotated data, and, a completeness class (same set of labels) that does not include the number of claims as a feature and so is more a measure of how complete an item is (a la the Recoin approach). Example: https://wikidata-quality.wmcloud.org/api/item-scores?qid=Q67559155 TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Weekly updates: - Discussed with Diego the challenge of whether our annotated data is really assessing what we want it to. I'll try to join the next meeting with Lydia to hear more and figure out our options. - Diego is also considering how embeddings might help with better missing property / out-of-date property / quality predictions for Wikidata subgraphs where we have a lot more data and the sorts of properties you might expect varies at finer-grained levels than just instance-of/occupation. For examples, instances where e.g., country of citizenship or age might further mediate what claims you'd expect. This could also be useful for fine-grained similarity to e.g., identify similar Wikidata items to use as examples or also improve. TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. I slightly tweaked the model but also experimented with adding just a simple square-root of the number of existing claims to the model and found that that is essentially that's all that is needed to almost match ORES quality (which is near perfect) for predicting item quality. That said, I think this is mainly an issue with the assessment data as opposed to Wikidata quality really just being about the number of statements. For example, the dataset has many Wikidata items that are for disambiguation pages and they're almost all rated E-class (lowest) because their only property is their instance-of. I'd argue though that that's perfectly acceptable for almost all disambiguation pages and these items are nearly complete even with just that one property (you can see the frequency of other properties that occur for these pages but they're pretty low: https://recoin.toolforge.org/getbyclassid.php?subject=Q4167410&n=200). So while the number of claims is a useful feature for matching human perception of quality, I think we'd actually want to leave it out to get closer to the concept of "to what degree is an item missing major information". Where most disambiguation pages would do just fine here but human items that have many more statements (but also a much higher expectation) wouldn't do as well. Notebook: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/v2_eval_wikidata_quality_model.ipynb Quick summary: 38.7% correct (62.6% within 1 class) using features ['label_s']. 56.7% correct (77.0% within 1 class) using features ['claim_s']. 44.8% correct (72.7% within 1 class) using features ['ref_s']. 77.3% correct (98.1% within 1 class) using features ['sqrt_num_claims']. 55.0% correct (75.3% within 1 class) using features ['label_s', 'claim_s']. 50.2% correct (74.5% within 1 class) using features ['label_s', 'ref_s']. 76.5% correct (98.4% within 1 class) using features ['label_s', 'sqrt_num_claims']. 54.2% correct (76.6% within 1 class) using features ['label_s', 'claim_s', 'ref_s']. 75.1% correct (98.3% within 1 class) using features ['label_s', 'claim_s', 'sqrt_num_claims']. 79.4% correct (97.7% within 1 class) using features ['label_s', 'ref_s', 'sqrt_num_claims']. 55.0% correct (78.4% within 1 class) using features ['claim_s', 'ref_s']. 75.3% correct (98.0% within 1 class) using features ['claim_s', 'sqrt_num_claims']. 78.8% correct (98.3% within 1 class) using features ['claim_s', 'ref_s', 'sqrt_num_claims']. 79.4% correct (98.7% within 1 class) using features ['ref_s', 'sqrt_num_claims']. 78.3% correct (97.9% within 1 class) using features ['label_s', 'claim_s', 'ref_s', 'sqrt_num_claims'] ORES is at (remembering it's trained on 2x more data including what I'm evaluating it on here): 87.1% correct and 98.3% within 1 class TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Weekly update: - I cleaned up the results notebook <https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/eval_wikidata_quality_model.ipynb#Results>. The original ORES model does better on the labeled data than my initial model. This isn't a big surprise -- it was trained directly on them and uses many more features. A few takeaways: - I think one salient thing in comparing feature lists to take from the ORES model is boosting the importance of having an image if that's a common property for similar items. - The real perceived benefit of this new model will be its simplicity and flexibility. If we had updated test data, I think the new model would perform much better comparatively because it shouldn't go stale in the same way the ORES model would go because I'm not hard-coding lots of rules but allowing the model to adapt and learn from the current state of Wikidata. - The ordinal logistic regression approach that I used might also not be working well. I never really planned to keep it even though it's a good theoretical match for the data because I think a simpler classification or linear regression model w/ cut-offs would be just as reasonable. I also only trained it on about 200 items so I'd have plenty of test data so certainly plenty of room to scale that up. - My model includes no features regarding the actual number of statements. They are implicitly included in the completeness proportions (e.g., what proportion of expected claims exist) but I suspect humans in labeling items pay much more attention to the sheer quantity of statements regardless of what's actually expected for an item of a given type. Not sure if this is a drawback or not but I like that it theoretically allows for an item to be high quality even if it only has a few statements. - Other big next step will be considering how to scale up the model so it could potentially run on LiftWing if that's desired. It has a few semi-large data dependencies and that might pose a challenge. TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. > Recoin I believe didn't exist at that point. It was also not integrated in the existing production systems. I don't think we ever did a proper analysis of what it's currently capable of and how good it is for judging Item quality. Thanks -- useful context. I'll see about evaluating it then and report back. I've been working on a prototype that essentially uses Recoin + additional rules for labels / references to generate a score. I'll then compare it against the labeled data from the original ORES campaign. You can see a super raw prototype here (scores at the very bottom of the notebook) but I'd wait a week or so until I can generate more interesting figures and actually fine-tune it: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/eval_wikidata_quality_model.ipynb TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. I started a PAWS notebook where I will evaluate the proposed strategy (Recoin with additional of reference/labels rules) against the 2020 dataset (~4k items) of assessed Wikidata item qualities. This will allow me to relatively cheapily assess the method before trying to scale up. Notebook: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/eval_wikidata_quality_model.ipynb TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac moved this task from FY2022-23-Research-October-December to FY2022-23-Research-January-March on the Research board. Isaac edited projects, added Research (FY2022-23-Research-January-March); removed Research (FY2022-23-Research-October-December). TASK DETAIL https://phabricator.wikimedia.org/T321224 WORKBOARD https://phabricator.wikimedia.org/project/board/45/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a subscriber: Lydia_Pintscher. Isaac added a comment. @Lydia_Pintscher I was reminded recently of Recoin <https://www.wikidata.org/wiki/Wikidata:Recoin> (and the closely related PropertySuggester <https://www.mediawiki.org/wiki/Extension:PropertySuggester>) and that got me wondering: is there a reason that the ORES model was used instead of Recoin? Or maybe more specifically, is there any reason not to use Recoin for assessing Wikidata item quality? What are its drawbacks? Looking through it, my impression was that it's quite good and that my approach likely would have been very similar. I do see a few places we could augment it: - Also assessing references in a similar way (based on how often a property is referenced on other items) to identify claims where references are missing or could be improved (e.g., imported from wikipedia) - Also assessing labels/descriptions based on which language sitelinks exist for the item -- e.g., if Japanese Wikipedia article, should also have Japanese label/description And then I know you asked about Properties / Lexemes -- presumably this same strategy could be adopted for them if it's indeed working well for items! TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Weekly updates: - I focused on the references component of the model this week. I built heavily on Amaral, Gabriel, Alessandro Piscopo, Lucie-Aimée Kaffee, Odinaldo Rodrigues, and Elena Simperl. "Assessing the quality of sources in Wikidata across languages: a hybrid approach." Journal of Data and Information Quality (JDIQ) 13, no. 4 (2021): 1-35. <https://arxiv.org/pdf/2109.09405.pdf> - I wrote a Python function (code below) that takes the references for a claim and maps it to high-level categories that tell us about the quality of the reference -- e.g., has an External URL associated with it vs. referring to internal Wikidata item or import from another Wikimedia project. I can imagine weak and strong recommendations based on this -- e.g., high priority would be adding missing references and lower priority might be updating Imported from Wikimedia Project to a external URL and very low priority might be adding a second reference. - Using that function, I can generate basic descriptive stats on reference distributions on Wikidata (table below) and split by property (top-100-most-common properties below). From this data, you can see that we might be able to automatically infer which properties definitely need references, which ones probably should have references, and which ones probably don't by just setting some basic heuristics. One challenge will be whether we use the current state of Wikidata (which is heavily bot-influenced so for certain properties, reflects the choice of a few people) or try to build a more nuanced dataset based on edit history of which properties have references when editors add them. # Code for categorizing references for a claim per a simple taxonomy that by proxy tells us something about authority/accessibility/usefulness of the reference # types of references from least -> best # so if a claim has two references and one is Internal-Stated and one is External-Direct, we keep External-Direct REF_ORDER = {r:i for r,i in enumerate( ['Internal-Inferred', 'Internal-Stated', 'Internal-Wikimedia', 'External-Identifier', 'External-Direct'])} EXTERNAL_ID_PROPERTIES = set() # all Wikidata properties that are external IDs -- used for detecting when used as part of a reference # TODO: Maybe update to SPARQL query that is external identifier properties ONLY with URL formatter properties? (maybe that's essentially the same thing?) # https://quarry.wmcloud.org/query/69919 with open('quarry-69919-wikidata-external-ids-run692643.tsv', 'r') as fin: for line in fin: EXTERNAL_ID_PROPERTIES.add(f'P{line.strip()}') def getReferenceType(references): """Map references for a claim to different categories. Heavily inspired by: https://arxiv.org/pdf/2109.09405.pdf Also: https://www.wikidata.org/wiki/Help:Sources """ if references is None: ref_count = 'unreferenced' best_ref_type = None else: ref_count = 'single' if len(references) == 1 else 'multiple' best_ref_types = [] for ref in references: # reference URL OR official website OR archive URL OR URL OR external data available at if 'P854' in ref['snaksOrder'] or 'P856' in ref['snaksOrder'] or 'P1065' in ref['snaksOrder'] or 'P953' in ref['snaksOrder'] or 'P2699' in ref['snaksOrder'] or 'P1325' in ref['snaksOrder']: best_ref_types.append('External-Direct') break elif [p for p in ref['snaksOrder'] if p in EXTERNAL_ID_PROPERTIES]: best_ref_types.append('External-Identifier') # Wikimedia import URL OR imported from Wikimedia project elif 'P4656' in ref['snaksOrder'] or 'P143' in ref['snaksOrder']: best_ref_types.append('Internal-Wikimedia') # stated in elif 'P248' in ref['snaksOrder']: best_ref_types.append('Internal-Stated') # inferred from Wikidata item OR based on heuristic OR based on elif 'P3452' in ref['snaksOrder'] or 'P887' in ref['snaksOrder'] or 'P144' in ref['snaksOrder']: best_ref_types.append('Internal-Inferred') # title OR published in -- hard to interpret without more info but probably links to Wikidata item elif 'P
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Able to start thinking about this again and a few thoughts: - Machine-in-the-loop: when we built quality models for the Wikipedia language communities, it was with the idea that the models could potentially support the existing editor processes for assigning article quality scores -- e.g., https://en.wikipedia.org/wiki/Wikipedia:Content_assessment. This generally aligns with our machine-in-the-loop practice of only building models that clearly could support and receive feedback from existing community processes. For the Wikidata, while there are reasonable guidelines <https://www.wikidata.org/wiki/Wikidata:Item_quality> for item quality, the only community-generated data was a one-off labeling campaign from 2020 via Wiki labels <https://meta.wikimedia.org/wiki/Wiki_labels/en>. This presents a major challenge: how do we improve on the existing ORES model to make it more maintainable / effective without a clear feedback loop that can be used to validate/update the model? One possible approach is to instead treat this as a task-identification model -- i.e. instead of seeking to model quality directly and therefore allowing vague features like the total # of references, we could design a model that seeks to explicitly build a list of missing/to-be-improved properties/aliases/descriptions/references. This list of changes could then always be converted into a quality score -- e.g., by computing a simple ratio of existing properties to missing properties or something like that -- but that would be secondary to the model. The community process that can provide feedback for this style of model then is just the regular editing process (albeit quite weakly because an edit doesn't tell you what else is missing). Eventually, it could feed into an actual interface similar to the Growth team's structured tasks <https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks> that would provide even more direct feedback, but in the meantime this still feels much more machine-in-the-loop than a direct quality model. - Reducing data drift: alongside this shift in design from quality -> task identification, we can also make the model more sustainable by doing less hard-coding of outliers (like asteroids <https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/wikidatawiki_data/items_lists.py>) and try to redesign the model to adapt to the existing structure of Wikidata when it is trained. For example, taking more the approach previously taken for external identifiers / media <https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/wikidatawiki_data/property_datatypes.py> where the relevant data structures that inform the model are easy to auto-generate and thus could be updated with each model training. This could be extended to e.g., lists of properties that commonly have references and lists of properties that commonly appear for a given instance-of. - Then the model would take an item as input and perhaps go something like: - Extract it's instance-of and sitelinks - Sitelinks would be used to help determine which aliases/descriptions should exist - Instance-ofs would be used to identify which properties are expected - For each of those expected properties, it would either be rated as missing, incomplete (missing reference etc.), or complete - And then all of this information could be compiled as specific tasks - And for the quality score, the list of tasks could be compared against the existing data to come to some general score. - The challenge then still is in the smart compiling of expected properties for a given instance-of, but I feel much better about the structure of this model because it's more transparent and anyone who is familiar with Wikidata could easily inspect the list of expected properties for a given instance-of and tweak it. - I'm now working on extracting the list of existing properties for each instance-of to see if most have a clear set of common properties TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Lydia_Pintscher, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Update: past few weeks have been busy so I haven't had a chance to look into this but I'm hoping to get more time in December to focus on it. TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Lydia_Pintscher, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model
Isaac added a comment. Weekly update: - Summarizing some past research shared / further examinations of the existing ORES model shared by LP: - We have to be careful to adjust expectations for a given claim depending on its property type (distribution of property types on Wikidata <https://quarry.wmcloud.org/query/68563>) -- e.g., no references for `external-id` properties. Current model uses a static list for this <https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/wikidatawiki_data/property_datatypes.py> but we might want to re-evaluate. - Even though number of sitelinks might correlate positively with quality, it's a feature we should avoid as it's really a proxy for popularity and not item quality - Wikidata is constantly shifting in big ways and out-of-date data / rules can lead to models handling particular instance-ofs poorly. We should do our best to make aspects of the model unsupervised or not dependent on a fixed set of data so it can adapt easily. - The current model is actually pretty good so maybe this is less about iterating on it significantly and more about thinking about redesigning it for new LiftWing paradigm and to be less susceptible to data drift. - Something I've been mulling over is how to ensure the model is actionable in a way that aligns with community goals and points to specific steps a contributor could take to raise quality. - For instance, adding/improving references is quite actionable and important. For the verifiability component then, it's worthwhile to ensure that the model handles this well -- i.e. has a good sense of which statements do and do not need references and differentiates between the different types of references (external vs. Wikipedia). - If we're less concerned about making items super extensive but do want to "require" a core set of basic properties (similar to Schemas or inteGraality <https://wikitech.wikimedia.org/wiki/Tool:InteGraality>), we might try to identify that core set of properties for each instance-of and try not to rely less on raw counts of statements in determining scores. - What about consistency -- is there some way to capture how well an item matches related ones? And if so, should an item be penalized for being "unique"? - LP also asked us to consider how to extend this to Lexemes and Properties. Will have to think through that and whether we can reuse some of the resulting model for those item types or if they require fully separate approaches. TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects
Isaac closed this task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T249654 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects
Isaac updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T249654 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects
Isaac added a comment. Weekly update: - cleaned up the meta page a little: https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion - this task is essentially done but I'm going to leave the task open at least another week to allow for continued discussion - further research steps in this space would be: - repeating the analysis in a language like Japanese which shows very little transclusion and a language like Catalan that presumably has much more. - automating the infobox portion of this analysis for enwiki - moving to the question raised in T246709 <https://phabricator.wikimedia.org/T246709> which is essentially applying the same taxonomy (high-, medium-, low-, no-importance) to Wikidata changes that show up in RecentChanges feeds to see how much "noise" appears in them and the best ways to provide better filters there. TASK DETAIL https://phabricator.wikimedia.org/T249654 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects
Isaac added a comment. @GoranSMilovanovic thanks! I'm pretty open on next steps. This work was done in part to help guide interpretation of potential WMF metrics around measuring transclusion but I would love to see some improvements made to the way we monitor transclusion if possible too. You'll have to let me know what you see as feasible / reasonable changes though and what I can do to help make them happen. In T246709#6367012 <https://phabricator.wikimedia.org/T246709#6367012> I noted that there are two potential improvements I could see made based on my very limited knowledge of how lua / these tables work: - Distinguishing between standard statements and identifiers in Lua calls. If this then was reflected in wbc_entity_usage, it would be much easier to distinguish between transclusion that is part of linked open data and transclusion that is facts like birthday etc. It would also substantially reduce noise in Recent Changes because, at least in English Wikipedia, the very common metadata templates like Authority Control <https://en.wikipedia.org/wiki/Template:Authority_control> and Taxonbar <https://en.wikipedia.org/wiki/Template:Taxonbar> trigger a general C aspect and so changes to any part of the Wikidata item show up in Recent Changes even when they have no impact on the article. In theory, a filter could be added to Recent Changes then to change how changes to identifiers show up in the feed. - I'm not sure if it's possible to distinguish between transclusion and tracking in the wbc_entity_usage table -- e.g., a parameter that can be passed with lua calls that indicates that the property is only being used for tracking. This might just be a hacky change that long-term isn't useful, but tracking categories generate a lot of the entries in the wbc_entity_usage table and are quite different in impact than transclusion. TASK DETAIL https://phabricator.wikimedia.org/T249654 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects
Isaac added a comment. > Thank you for this analysis - really useful! Thanks! Glad to hear :) Additionally, I made some notes here about how these findings my inform patrolling of Wikidata transclusion (T246709#6367012 <https://phabricator.wikimedia.org/T246709#6367012>) and am working on hopefully writing this up for the Wikidata Workshop. TASK DETAIL https://phabricator.wikimedia.org/T249654 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?
Isaac added a comment. The results reported in T249654#6352573 <https://phabricator.wikimedia.org/T249654#6352573> have some potential insight into how we think about supporting patrolling of Wikidata transclusion within Wikipedia articles so I wanted to record some of my initial thoughts here. We would want to talk with patrollers before actually thinking about implementing any of these and unfortunately I'm not actually working on this aspect of the project at the moment. However: the recent changes feed for a given article likely has many more Wikidata-related changes than are actually pertinent to an article from a patrolling standpoint. Some thoughts on reducing this noise: - Many entries to wbc_entity_usage are from transclusion that only generates tracking categories (e.g., Category:Coordinates on Wikidata <https://en.wikipedia.org/wiki/Category:Coordinates_on_Wikidata>) so arguably there should be a way to mark events on Recent Changes caused by these as tracking-only so patrollers could easily ignore them. - Many entries to wbc_entity_usage are from metadata templates like Authority Control and Taxonbar that are very valuable from a linked-data perspective but less from a reader's perspective and have a very low potential for harmful vandalism. Because the way both of these templates are written, they also trigger a general "statements" aspect usage, so any changes to statements on the Wikidata item would trigger an event on recent changes. This adds a bunch of noise to the Recent Changes feed from Wikidata where these templates are used. Additionally, in reality, changes to Wikidata identifiers that impact Authority Control and Taxonbar have a very low likelihood of being problematic from a reader's perspective because the external links that are generated via these templates go to well-curated repositories of information so the reader should quickly realize the link is incorrect and probably won't end up viewing offensive material. Ideally these templates would be rewritten to only trigger the specific properties they transclude, but in practice I could see that being difficult, inefficient, or causing the wbc_entity_usage table to become far too large to be practical (as each usage of Authority Control would trigger close to 100 rows, 1 for each property that can be transcluded <https://en.wikipedia.org/wiki/Template:Authority_control#Wikidata>). Instead, maybe wbc_entity_usage could be expanded to distinguish between general statements (C.S?) and identifiers (C.I?)? This would make filtering out changes to identifiers far easier and metadata templates then could still be recorded simply without causing every change to date of birth, occupation, etc. to also trigger a change. Unfortunately, I suspect this would require making non-trivial changes to the Lua modules and then convincing template coders to adapt the code. - Some entries to wbc_entity_usage go to generating external links that could more clearly generate harm if vandalized and probably do warrant focus from patrollers. For instance, Wikidata templates that generate links to Commons categories or external links to IMDb etc. could more clearly be abused to link to offensive material. Thankfully, given the specific nature of these templates, they generally are recorded with their specific property and so don't generate noise for patrollers. That said, a not insignificant amount of their usage (on enwiki) is only for tracking categories, so any changes that would distinguish between actual transclusion and tracking categories would serve to reduce noise for this. - Finally, infobox transclusion has probably the greatest potential for harm (e.g., falsifying someone's age or where they were born). This seems to be tracked pretty well for most infoboxes (the specific properties each get their own row and labels for each item that was actually transcluded) so I think it's more about reducing the noise from the above so that patrollers can more easily see these changes. TASK DETAIL https://phabricator.wikimedia.org/T246709 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, Akuckartz, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects
Isaac added a comment. > This is "overall articles for all projects", correct? It's actually just for English Wikipedia. The number from the WMDE dashboard <https://wmdeanalytics.wmflabs.org/WD_percentUsageDashboard/> for all Wikipedia projects is 31.99% (i.e. the inverse of the 68.01% number provided under "% of Articles that use Wikidata" in the tinier table that aggregates each project family). It varies a lot by wiki too -- vecwiki seems to have almost every article with some form of Wikidata transclusion whereas 62% of articles on Japanese Wikipedia don't have a single Wikidata-based template. This data was only recently added there (see T257962 <https://phabricator.wikimedia.org/T257962>). > How is this calculated? > Do you have your selects to group by "importance" using wbc_entity_usage? Wikidata description usage isn't tracked on wbc_entity_usage as far as I can tell so can't be queried in any straightforward way. The way I reached the 54% number is that I checked each article in my sample to see whether the description was from Wikidata using the gadget mentioned here <https://en.wikipedia.org/wiki/Wikipedia:Short_description#Making_it_visible_in_the_page>. On enwiki at least, Wikidata is the default unless there is a short description provided on the page, which supposedly is tracked by this category <https://en.wikipedia.org/wiki/Category:Articles_with_short_description> (which is how I verified this number -- you can see that the category has 2.1M pages in it, so about 1/3 of articles overwrite the Wikidata description). That said, maybe 10-20% of articles that used Wikidata didn't actually show a description because it hadn't been added in Wikidata yet. > And, forgot to say, THIS IS SUPER USEFUL, thanks! Thanks!! TASK DETAIL https://phabricator.wikimedia.org/T249654 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?
Isaac added a comment. @Lydia_Pintscher that makes sense and thanks for reaching out. I'm not going to schedule the meeting right now because I don't want to use up your time if we don't end up prioritizing this work, but when we do, I'll reach out! TASK DETAIL https://phabricator.wikimedia.org/T246709 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?
Isaac added a comment. > If we have a concrete example to look at I can try to figure that out :) Actually, I think I found the reason for most of the pages: https://en.wikipedia.org/wiki/Template:Authority_control It's generic because it pulls any external identifiers so can't be defined in advance which ones will be transcluded. And while "noise' from this example could be reduced by somehow indicating in wbc_entity_usage whether the properties used are identifiers or statements, I recognize that that doesn't solve the larger challenge. TASK DETAIL https://phabricator.wikimedia.org/T246709 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Retitled] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?
Isaac renamed this task from "What percentage of edits via Wikidata transclusion are missing on Recent Changes?" to "What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?". TASK DETAIL https://phabricator.wikimedia.org/T246709 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T246709: What percentage of edits via Wikidata transclusion are missing on Recent Changes?
Isaac added a comment. Thanks for the additional details @Addshore ! Some context: this task isn't being worked right now. I just created it as a potential future analysis because I had just become aware that Wikidata item properties were tracked specifically in wbc_entity_usage and think some good numbers on this would be valuable to track. Based on what you said, I think I should probably rename this task to focus on the edit history as that's closer to what I'm actually interested in (I now see the confusion that the current title causes) -- i.e. edits that originate on Wikidata and actually change the content of an associated Wikipedia article. The Wikidata tracking in Recent Changes feed seems to be still far too noisy for this purpose. Looking at English Wikipedia for example, the challenge with the Recent Changes feed of Wikidata edits <https://en.wikipedia.org/wiki/Special:RecentChanges?hidebots=1&hidepageedits=1&hidenewpages=1&hidecategorization=1&hidelog=1&limit=50&days=30&urlversion=2> is that almost none of those changes (I couldn't actually find any in my quick checking) actually affected the content of the page. Sitelinks obviously matter but I'm not considering them at the moment. Almost all the property changes in that feed are surfaced because the associated Wikipedia article has a generic C property on the wbc_entity_usage table (as opposed to C.P indicating that specific properties are being transcluded). I don't actually understand why they have that C property listed on wbc_entity_usage. > The approach described in the description is likely to get you some fairly unreliable data. Could you provide some more details here? TASK DETAIL https://phabricator.wikimedia.org/T246709 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T209655: Copy Wikidata dumps to HDFS
Isaac added a comment. > @JAllemandou Thank you - as ever! +1: these wikidata parquet (specifically item_page_link) dumps are super useful for us! TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Isaac, Groceryheist, MGerlach, WMDE-leszek, abian, leila, Ottomata, Nuria, GoranSMilovanovic, Addshore, JAllemandou, bmansurov, 4748kitoko, darthmon_wmde, Nandana, Akovalyov, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, terrrydactyl, Wikidata-bugs, aude, Capt_Swing, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs
Isaac added a comment. Hey @JAllemandou - this is great! thanks for catching that - looks all good to me now too. TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Marostegui, Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs
Isaac added a comment. Hey @JAllemandou, some debugging: a number of items aren't showing up and I can't for the life of me figure out. The few I've looked at are pretty normal articles (for example: https://de.wikipedia.org/wiki/Gregor_Grillemeier) and show up in the original parquet files (`/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204`) But according to this analysis (T209891#4798717 <https://phabricator.wikimedia.org/T209891#4798717>) and ebernhardson's table (`SELECT count(page_id) from ebernhardson.cirrus2hive where wikiid = 'enwiki' and dump_date='20190121';`), there should be ~5.7 million english articles w/ associated wikidata items and I'm only seeing 916 thousand. I went through your query but could not find anything that would be causing this dropout so I'm at a loss. Thoughts? Code in case I'm doing something wrong: count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP BY wiki_db') wikidataParquetPath = '/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204' spark.read.parquet(wikidataParquetPath).createOrReplaceTempView('wikidata') count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP BY wiki_db') If you sort the outcome then, you get: +--++ | wiki_db|count(1)| +--++ |zhwiki| 1245854| |jawiki| 1210483| |enwiki| 916393| | cebwiki| 891045| |svwiki| 778952| |dewiki| 656622| |frwiki| 414492| |nlwiki| 414469| |ruwiki| 413733| ... TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Marostegui, Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs
Isaac added a comment. @diego: my interpretation is that right now in the revision history version, the same wikidb/page ID/title is associated with the same wikidata ID regardless of when the revision occurred. what is the use for that over a table that has just one entry per wikidb/page ID/title? i'm trying to understand so i don't end up making a mistake about my interpretation of the linksTASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: IsaacCc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs
Isaac added a comment. thank you @JAllemandou this is awesome!!! completely unblocks me (i have a bunch of page titles across all the wikipedias and need to check whether a pair of them match the same wikidata item)!TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: IsaacCc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T215413: Image Classification Working Group
Isaac added a comment. If we go down that pathway of trying to identify what images are photographs, we should look into work by a former colleague of mine on detecting visualizations on Commons (in some ways, the inverse task): http://brenthecht.com/publications/www18_vizbywiki.pdf He (Allen Lin) might have some insight into some easy wins or pitfalls in building a model like that.TASK DETAILhttps://phabricator.wikimedia.org/T215413EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Miriam, IsaacCc: Mholloway, PDrouin-WMF, Krenair, d.astrikov, JoeWalsh, Nirzar, dcausse, fgiunchedi, JAllemandou, leila, Capt_Swing, mpopov, Nuria, DarTar, Halfak, Gilles, EBernhardson, dr0ptp4kt, Harej, MusikAnimal, Abit, elukey, diego, Cparle, Ramsey-WMF, Miriam, Isaac, Nandana, JKSTNK, Akovalyov, Lahi, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, EBjune, Tramullas, Acer, V4switch, LawExplorer, Avner, Silverfish, _jensen, Susannaanas, Wong128hk, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Matanya, Mbch331, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs