[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

Isaac Tue, 11 Apr 2023 12:51:18 -0700

Isaac added a comment.


  From discussion with Lydia/Diego:
  
  - The concept of `completeness` feels closer to what we want than `quality` 
-- i.e. allowing for more nuance in how many statements are associated with a 
given item. We came up with a few ideas for how to make assessing item 
completeness easier (because otherwise it would require very extensive 
knowledge of a domain area to know how many statements should be associated 
with an item): I suggested providing the completeness score and quality score 
and asking the evaluator which was more appropriate but I like Lydia's idea 
better which was to just provide the completeness score and ask the evaluator 
if they felt that the actual score was lower, the same, or higher.
  - Putting together a dataset like this would be fairly straightforward -- the 
main challenge is having a nice stratified dataset and one that provides 
information on top of the original quality-oriented dataset. For example, for 
highly-extensive items, both models tend to agree that the item is A-class so 
collecting a lot more annotations won't tell us much. It's only for the shorter 
items where we begin to see discrepancies and so that's where we should 
probably focus our efforts. Plus because the model is very specific to the 
instance-of/occupation properties, we should make sure to have a diversity of 
items by those properties. This is my main TODO.
  - I read through the paper 
<https://link.springer.com/chapter/10.1007/978-3-030-49461-2_11> describing the 
new proposed Wikidata Property Suggester approach. My understanding of the 
existing item-completeness/recommender systems:
    - Existing Wikidata Property Suggester: make recommendations for properties 
to add based on statistics on co-occurrence of properties. Ignores values of 
these properties except for instance-of/subclass-of where the statistics are 
based on the value. Recommendations are ranked by probability of co-occurrence.
    - Recoin: similar to above but only uses instance-of property for 
determining missing properties and adds in refinement of which occupation the 
item has if it's a human.
    - Proposed Wikidata Property Suggester: more advanced system for finding 
likely co-occurring properties based on more fine-grained association rules -- 
i.e. doesn't just merge all the individual "if Property A -> Property B k% of 
the time" but instead does things like 'if Property A and Property B and ... -> 
Property N k% of the time". Also takes into account instance-of/subclass-of 
property values like the existing suggester. This seems like a pretty 
reasonable enhancement and their approach is quite lightweight (~1.5GB RAM for 
holding data structure).
  - I am following the Recoin approach in my model though if the new Property 
Suggester proves successful and provides the data needed to incorporate into 
the model (a list of likely missing properties + confidence scores), it would 
be very reasonable to incorporate that in in place of the Recoin model at a 
later point and also solve some of the problems that @diego was considering 
addressing via wikidata embeddings (more nuanced recommendations of missing 
properties).

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, 
karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

Reply via email to