Isaac added a comment.
I'm going to be out the next several weeks so FYI likely won't hear updates
until mid-September on this. Thanks for these additional details though!
> Now there are several Properties that can represent such relations. The
main ones we should probably
Isaac added a comment.
> That's quite an interesting table! Would it be possible to get the actual
Item IDs for the last two rows? It could be instructive to know which Items the
model thinks are very incomplete but have excellent quality :)
@Michael thanks for the questio
Isaac added a comment.
Oooh and the job worked! High-level data on overlap between the two scores
where they are the same except completeness just takes into account how many of
the expected claims/refs/labels are there and quality adds the total number of
claims to the features too
Isaac added a comment.
Updates:
- Finally ported all the code from the API to work on the cluster. I don't
know if it'll run to completeness yet but I ran it on a subset and the results
largely matched the API:
https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/b
Isaac added a comment.
Updates:
- Wrestling with re-adapting everything to the cluster but making good
progress. One of the main challenges is that the wikidata item schema is
different between cluster and API so lots of little errors that I'm having to
discover and correct as I
Isaac added a comment.
Updates:
- Successfully generated the property data I need so now I have the necessary
data to run the model in bulk on the cluster and can turn towards generating a
dataset for sampling. Notebook:
https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia
Isaac added a comment.
Updates:
- Began process of regenerating property-frequency table on cluster given
that we shouldn't depend on RECOIN for bulk computation even if it greatly
simplifies the API prototype. Working out a few bugs but feel like I have the
right approac
Isaac added a comment.
No updates still with prep for wikiworkshop/hackathon but after next week,
hoping to get back to this!
TASK DETAIL
https://phabricator.wikimedia.org/T321224
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Isaac
Cc: Michael
Isaac added a comment.
From discussion with Lydia/Diego:
- The concept of `completeness` feels closer to what we want than `quality`
-- i.e. allowing for more nuance in how many statements are associated with a
given item. We came up with a few ideas for how to make assessing item
Isaac added a comment.
Updated API to be slightly more robust to instance-of-only edge cases and
provide the individual features. Output for
https://wikidata-quality.wmcloud.org/api/item-scores?qid=Q67559155:
{
"item": "https://www.wikidata.org/wiki/Q67559155&quo
Isaac added a comment.
I still need to do some checks because I know e.g., this fails when the item
lacks statements, but I put together an API for testing the model. It has two
outputs: a quality class (E worst to A best) that uses the number of claims on
the item as a feature (along with
Isaac added a comment.
Weekly updates:
- Discussed with Diego the challenge of whether our annotated data is really
assessing what we want it to. I'll try to join the next meeting with Lydia to
hear more and figure out our options.
- Diego is also considering how embeddings might
Isaac added a comment.
I slightly tweaked the model but also experimented with adding just a simple
square-root of the number of existing claims to the model and found that that
is essentially that's all that is needed to almost match ORES quality (which is
near perfect) for predicting
Isaac added a comment.
Weekly update:
- I cleaned up the results notebook
<https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/eval_wikidata_quality_model.ipynb#Results>.
The original ORES model does better on the labeled data than my initial model.
This isn
Isaac added a comment.
> Recoin I believe didn't exist at that point. It was also not integrated in
the existing production systems. I don't think we ever did a proper analysis of
what it's currently capable of and how good it is for judging Item quality.
Thanks -- us
Isaac added a comment.
I started a PAWS notebook where I will evaluate the proposed strategy (Recoin
with additional of reference/labels rules) against the 2020 dataset (~4k items)
of assessed Wikidata item qualities. This will allow me to relatively cheapily
assess the method before trying
Isaac moved this task from FY2022-23-Research-October-December to
FY2022-23-Research-January-March on the Research board.
Isaac edited projects, added Research (FY2022-23-Research-January-March);
removed Research (FY2022-23-Research-October-December).
TASK DETAIL
https
Isaac added a subscriber: Lydia_Pintscher.
Isaac added a comment.
@Lydia_Pintscher I was reminded recently of Recoin
<https://www.wikidata.org/wiki/Wikidata:Recoin> (and the closely related
PropertySuggester <https://www.mediawiki.org/wiki/Extension:PropertySuggester>)
and
Isaac added a comment.
Weekly updates:
- I focused on the references component of the model this week. I built
heavily on Amaral, Gabriel, Alessandro Piscopo, Lucie-Aimée Kaffee, Odinaldo
Rodrigues, and Elena Simperl. "Assessing the quality of sources in Wikidata
across languag
Isaac added a comment.
Able to start thinking about this again and a few thoughts:
- Machine-in-the-loop: when we built quality models for the Wikipedia
language communities, it was with the idea that the models could potentially
support the existing editor processes for assigning
Isaac added a comment.
Update: past few weeks have been busy so I haven't had a chance to look into
this but I'm hoping to get more time in December to focus on it.
TASK DETAIL
https://phabricator.wikimedia.org/T321224
EMAIL PREFERENCES
https://phabricator.wikimedia.org/sett
Isaac added a comment.
Weekly update:
- Summarizing some past research shared / further examinations of the
existing ORES model shared by LP:
- We have to be careful to adjust expectations for a given claim depending
on its property type (distribution of property types on Wikidata
Isaac closed this task as "Resolved".
TASK DETAIL
https://phabricator.wikimedia.org/T249654
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Isaac
Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach,
GoranSMilovano
Isaac updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T249654
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Isaac
Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach,
GoranSMilovanovic, Isaac
Isaac added a comment.
Weekly update:
- cleaned up the meta page a little:
https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion
- this task is essentially done but I'm going to leave the task open at least
another week to allo
Isaac added a comment.
@GoranSMilovanovic thanks! I'm pretty open on next steps. This work was done
in part to help guide interpretation of potential WMF metrics around measuring
transclusion but I would love to see some improvements made to the way we
monitor transclusion if possibl
Isaac added a comment.
> Thank you for this analysis - really useful!
Thanks! Glad to hear :)
Additionally, I made some notes here about how these findings my inform
patrolling of Wikidata transclusion (T246709#6367012
<https://phabricator.wikimedia.org/T246709#6367012>
Isaac added a comment.
The results reported in T249654#6352573
<https://phabricator.wikimedia.org/T249654#6352573> have some potential insight
into how we think about supporting patrolling of Wikidata transclusion within
Wikipedia articles so I wanted to record some of my initial th
Isaac added a comment.
> This is "overall articles for all projects", correct?
It's actually just for English Wikipedia. The number from the WMDE dashboard
<https://wmdeanalytics.wmflabs.org/WD_percentUsageDashboard/> for all Wikipedia
projects is 31.99% (i.e.
Isaac added a comment.
@Lydia_Pintscher that makes sense and thanks for reaching out. I'm not going
to schedule the meeting right now because I don't want to use up your time if
we don't end up prioritizing this work, but when we do, I'll reach out!
Isaac added a comment.
> If we have a concrete example to look at I can try to figure that out :)
Actually, I think I found the reason for most of the pages:
https://en.wikipedia.org/wiki/Template:Authority_control
It's generic because it pulls any external identifiers so
Isaac renamed this task from "What percentage of edits via Wikidata
transclusion are missing on Recent Changes?" to "What proportion of a Wikipedia
article's edit history might reasonably be changes via Wikidata transclusion?".
TASK DETAIL
https://phabricator.wik
Isaac added a comment.
Thanks for the additional details @Addshore !
Some context: this task isn't being worked right now. I just created it as a
potential future analysis because I had just become aware that Wikidata item
properties were tracked specifically in wbc_entity_usag
Isaac added a comment.
> @JAllemandou Thank you - as ever!
+1: these wikidata parquet (specifically item_page_link) dumps are super
useful for us!
TASK DETAIL
https://phabricator.wikimedia.org/T209655
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/pa
Isaac added a comment.
Hey @JAllemandou - this is great! thanks for catching that - looks all good
to me now too.
TASK DETAIL
https://phabricator.wikimedia.org/T215616
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Isaac
Cc: Marostegui, Isaac
Isaac added a comment.
Hey @JAllemandou, some debugging: a number of items aren't showing up and I
can't for the life of me figure out. The few I've looked at are pretty normal
articles (for example: https://de.wikipedia.org/wiki/Gregor_Grillemeier) and
show up in the origina
Isaac added a comment.
@diego: my interpretation is that right now in the revision history version, the same wikidb/page ID/title is associated with the same wikidata ID regardless of when the revision occurred. what is the use for that over a table that has just one entry per wikidb/page ID/title
Isaac added a comment.
thank you @JAllemandou this is awesome!!! completely unblocks me (i have a bunch of page titles across all the wikipedias and need to check whether a pair of them match the same wikidata item)!TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps
Isaac added a comment.
If we go down that pathway of trying to identify what images are photographs, we should look into work by a former colleague of mine on detecting visualizations on Commons (in some ways, the inverse task): http://brenthecht.com/publications/www18_vizbywiki.pdf
He (Allen Lin
39 matches
Mail list logo