[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-08-03 Thread Isaac
Isaac added a comment. I'm going to be out the next several weeks so FYI likely won't hear updates until mid-September on this. Thanks for these additional details though! > Now there are several Properties that can represent such relations. The main ones we should probably

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-07-25 Thread Isaac
Isaac added a comment. > That's quite an interesting table! Would it be possible to get the actual Item IDs for the last two rows? It could be instructive to know which Items the model thinks are very incomplete but have excellent quality :) @Michael thanks for the questio

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-07-21 Thread Isaac
Isaac added a comment. Oooh and the job worked! High-level data on overlap between the two scores where they are the same except completeness just takes into account how many of the expected claims/refs/labels are there and quality adds the total number of claims to the features too

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-07-21 Thread Isaac
Isaac added a comment. Updates: - Finally ported all the code from the API to work on the cluster. I don't know if it'll run to completeness yet but I ran it on a subset and the results largely matched the API: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/b

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-06-30 Thread Isaac
Isaac added a comment. Updates: - Wrestling with re-adapting everything to the cluster but making good progress. One of the main challenges is that the wikidata item schema is different between cluster and API so lots of little errors that I'm having to discover and correct as I

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-06-23 Thread Isaac
Isaac added a comment. Updates: - Successfully generated the property data I need so now I have the necessary data to run the model in bulk on the cluster and can turn towards generating a dataset for sampling. Notebook: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-06-16 Thread Isaac
Isaac added a comment. Updates: - Began process of regenerating property-frequency table on cluster given that we shouldn't depend on RECOIN for bulk computation even if it greatly simplifies the API prototype. Working out a few bugs but feel like I have the right approac

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-05-12 Thread Isaac
Isaac added a comment. No updates still with prep for wikiworkshop/hackathon but after next week, hoping to get back to this! TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Michael

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-04-11 Thread Isaac
Isaac added a comment. From discussion with Lydia/Diego: - The concept of `completeness` feels closer to what we want than `quality` -- i.e. allowing for more nuance in how many statements are associated with a given item. We came up with a few ideas for how to make assessing item

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-03-24 Thread Isaac
Isaac added a comment. Updated API to be slightly more robust to instance-of-only edge cases and provide the individual features. Output for https://wikidata-quality.wmcloud.org/api/item-scores?qid=Q67559155: { "item": "https://www.wikidata.org/wiki/Q67559155&quo

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-03-17 Thread Isaac
Isaac added a comment. I still need to do some checks because I know e.g., this fails when the item lacks statements, but I put together an API for testing the model. It has two outputs: a quality class (E worst to A best) that uses the number of claims on the item as a feature (along with

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-03-10 Thread Isaac
Isaac added a comment. Weekly updates: - Discussed with Diego the challenge of whether our annotated data is really assessing what we want it to. I'll try to join the next meeting with Lydia to hear more and figure out our options. - Diego is also considering how embeddings might

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-03-03 Thread Isaac
Isaac added a comment. I slightly tweaked the model but also experimented with adding just a simple square-root of the number of existing claims to the model and found that that is essentially that's all that is needed to almost match ORES quality (which is near perfect) for predicting

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-02-16 Thread Isaac
Isaac added a comment. Weekly update: - I cleaned up the results notebook <https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/eval_wikidata_quality_model.ipynb#Results>. The original ORES model does better on the labeled data than my initial model. This isn&#

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-02-10 Thread Isaac
Isaac added a comment. > Recoin I believe didn't exist at that point. It was also not integrated in the existing production systems. I don't think we ever did a proper analysis of what it's currently capable of and how good it is for judging Item quality. Thanks -- us

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-01-27 Thread Isaac
Isaac added a comment. I started a PAWS notebook where I will evaluate the proposed strategy (Recoin with additional of reference/labels rules) against the 2020 dataset (~4k items) of assessed Wikidata item qualities. This will allow me to relatively cheapily assess the method before trying

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-01-24 Thread Isaac
Isaac moved this task from FY2022-23-Research-October-December to FY2022-23-Research-January-March on the Research board. Isaac edited projects, added Research (FY2022-23-Research-January-March); removed Research (FY2022-23-Research-October-December). TASK DETAIL https

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-01-12 Thread Isaac
Isaac added a subscriber: Lydia_Pintscher. Isaac added a comment. @Lydia_Pintscher I was reminded recently of Recoin <https://www.wikidata.org/wiki/Wikidata:Recoin> (and the closely related PropertySuggester <https://www.mediawiki.org/wiki/Extension:PropertySuggester>) and

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2022-12-22 Thread Isaac
Isaac added a comment. Weekly updates: - I focused on the references component of the model this week. I built heavily on Amaral, Gabriel, Alessandro Piscopo, Lucie-Aimée Kaffee, Odinaldo Rodrigues, and Elena Simperl. "Assessing the quality of sources in Wikidata across languag

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2022-12-16 Thread Isaac
Isaac added a comment. Able to start thinking about this again and a few thoughts: - Machine-in-the-loop: when we built quality models for the Wikipedia language communities, it was with the idea that the models could potentially support the existing editor processes for assigning

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2022-12-02 Thread Isaac
Isaac added a comment. Update: past few weeks have been busy so I haven't had a chance to look into this but I'm hoping to get more time in December to focus on it. TASK DETAIL https://phabricator.wikimedia.org/T321224 EMAIL PREFERENCES https://phabricator.wikimedia.org/sett

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2022-11-04 Thread Isaac
Isaac added a comment. Weekly update: - Summarizing some past research shared / further examinations of the existing ORES model shared by LP: - We have to be careful to adjust expectations for a given claim depending on its property type (distribution of property types on Wikidata

[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-28 Thread Isaac
Isaac closed this task as "Resolved". TASK DETAIL https://phabricator.wikimedia.org/T249654 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, GoranSMilovano

[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-21 Thread Isaac
Isaac updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T249654 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, GoranSMilovanovic, Isaac

[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-21 Thread Isaac
Isaac added a comment. Weekly update: - cleaned up the meta page a little: https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion - this task is essentially done but I'm going to leave the task open at least another week to allo

[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-13 Thread Isaac
Isaac added a comment. @GoranSMilovanovic thanks! I'm pretty open on next steps. This work was done in part to help guide interpretation of potential WMF metrics around measuring transclusion but I would love to see some improvements made to the way we monitor transclusion if possibl

[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-06 Thread Isaac
Isaac added a comment. > Thank you for this analysis - really useful! Thanks! Glad to hear :) Additionally, I made some notes here about how these findings my inform patrolling of Wikidata transclusion (T246709#6367012 <https://phabricator.wikimedia.org/T246709#6367012>

[Wikidata-bugs] [Maniphest] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?

2020-08-06 Thread Isaac
Isaac added a comment. The results reported in T249654#6352573 <https://phabricator.wikimedia.org/T249654#6352573> have some potential insight into how we think about supporting patrolling of Wikidata transclusion within Wikipedia articles so I wanted to record some of my initial th

[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-07-31 Thread Isaac
Isaac added a comment. > This is "overall articles for all projects", correct? It's actually just for English Wikipedia. The number from the WMDE dashboard <https://wmdeanalytics.wmflabs.org/WD_percentUsageDashboard/> for all Wikipedia projects is 31.99% (i.e.

[Wikidata-bugs] [Maniphest] [Commented On] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?

2020-03-03 Thread Isaac
Isaac added a comment. @Lydia_Pintscher that makes sense and thanks for reaching out. I'm not going to schedule the meeting right now because I don't want to use up your time if we don't end up prioritizing this work, but when we do, I'll reach out!

[Wikidata-bugs] [Maniphest] [Commented On] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?

2020-03-02 Thread Isaac
Isaac added a comment. > If we have a concrete example to look at I can try to figure that out :) Actually, I think I found the reason for most of the pages: https://en.wikipedia.org/wiki/Template:Authority_control It's generic because it pulls any external identifiers so

[Wikidata-bugs] [Maniphest] [Retitled] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?

2020-03-02 Thread Isaac
Isaac renamed this task from "What percentage of edits via Wikidata transclusion are missing on Recent Changes?" to "What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?". TASK DETAIL https://phabricator.wik

[Wikidata-bugs] [Maniphest] [Commented On] T246709: What percentage of edits via Wikidata transclusion are missing on Recent Changes?

2020-03-02 Thread Isaac
Isaac added a comment. Thanks for the additional details @Addshore ! Some context: this task isn't being worked right now. I just created it as a potential future analysis because I had just become aware that Wikidata item properties were tracked specifically in wbc_entity_usag

[Wikidata-bugs] [Maniphest] [Commented On] T209655: Copy Wikidata dumps to HDFS

2020-01-14 Thread Isaac
Isaac added a comment. > @JAllemandou Thank you - as ever! +1: these wikidata parquet (specifically item_page_link) dumps are super useful for us! TASK DETAIL https://phabricator.wikimedia.org/T209655 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/pa

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-26 Thread Isaac
Isaac added a comment. Hey @JAllemandou - this is great! thanks for catching that - looks all good to me now too. TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Marostegui, Isaac

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-25 Thread Isaac
Isaac added a comment. Hey @JAllemandou, some debugging: a number of items aren't showing up and I can't for the life of me figure out. The few I've looked at are pretty normal articles (for example: https://de.wikipedia.org/wiki/Gregor_Grillemeier) and show up in the origina

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread Isaac
Isaac added a comment. @diego: my interpretation is that right now in the revision history version, the same wikidb/page ID/title is associated with the same wikidata ID regardless of when the revision occurred. what is the use for that over a table that has just one entry per wikidb/page ID/title

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread Isaac
Isaac added a comment. thank you @JAllemandou this is awesome!!! completely unblocks me (i have a bunch of page titles across all the wikipedias and need to check whether a pair of them match the same wikidata item)!TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps

[Wikidata-bugs] [Maniphest] [Commented On] T215413: Image Classification Working Group

2019-02-07 Thread Isaac
Isaac added a comment. If we go down that pathway of trying to identify what images are photographs, we should look into work by a former colleague of mine on detecting visualizations on Commons (in some ways, the inverse task): http://brenthecht.com/publications/www18_vizbywiki.pdf He (Allen Lin