[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-08-03 Thread Isaac
Isaac added a comment.


  I'm going to be out the next several weeks so FYI likely won't hear updates 
until mid-September on this. Thanks for these additional details though!
  
  > Now there are several Properties that can represent such relations. The 
main ones we should probably focus on are instance of, subclass of and part of 
as explained on https://www.wikidata.org/wiki/Help:Basic_membership_properties.
  
  Everything is currently based on instance-of values but looks like I need to 
also allow `subclass of` and `part of`. The tricky thing there is that I assume 
most `subclass of` and `part of` statements are pretty rare -- e.g., there are 
only so many items with `subclass of` for `physicist` -- so it's hard to learn 
the expectations for `subclass of physicist` and my best bet is probably a more 
generic set of expected properties for any item that has a `subclass of` 
property regardless of its value (and same for `part of`). I'll have to see how 
consistent these properties are though because if they're highly specific to 
the value of `subclass of`, the model won't be able to do anything useful with 
them. I'm hopeful that this small change though will fix these various outliers 
and get us to a place where we can reasonably test the verify that the model is 
doing what we expect.

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Danny_Benjafield_WMDE, 
KinneretG, Astuthiodit_1, YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-07-25 Thread Isaac
Isaac added a comment.


  > That's quite an interesting table! Would it be possible to get the actual 
Item IDs for the last two rows? It could be instructive to know which Items the 
model thinks are very incomplete but have excellent quality :)
  
  @Michael thanks for the questions! Some context: I think the completeness 
model is better suited for evaluating items (it's much more nuanced than the 
quality model, which largely just takes into consideration the number of 
statements an item has). This analysis hopefully will do two things: 1) help us 
find some places where the completeness model doesn't do great and we could 
tweak it, and, 2) build a sample of items to give to Wikidata experts to ensure 
that the completeness model is in fact capturing their expectations better than 
the quality model.
  
  Looking into the extreme ends of the data, most of the 241 that were low 
completeness and high quality are items that have many statements but lack an 
instance-of property that I use for categorizing an item to determine which 
properties it should have. I assume that it is okay for items to lack an 
instance-of if they're subclasses of another item? Perhaps I can make a special 
case for items that lack an instance-of but do have a subclass-of property? 
Though in checking a few examples, I don't know if there is a particularly 
consistent set of expectations around which properties should exist for these 
items. Example: Q22698 (park) <https://www.wikidata.org/wiki/Q22698>. The other 
set are items like Q7473516 (Tokyo) <https://www.wikidata.org/wiki/Q7473516>, 
which have a bunch of statements but are lacking references and also have a 
bunch of instance-ofs so missing some expected statements too.
  
Data -- instead of the labels, I'm outputting the raw scores which range 
from 0 (very bad) to 1 (very good) for both the individual features and the 
overall completeness/quality scores. Number of statements is what it sounds 
like.

+++---++--+---+--+
|item|claims_score|refs_score 
|labels_score|num_statements|completeness_score |quality_score |

+++---++--+---+--+
|https://www.wikidata.org/wiki/Q907112   |0.37053642  |0.0948047  |0.65625  
   |102.0 |0.334656685590744  |1.0   |
|https://www.wikidata.org/wiki/Q34754|0.36810225  |0.15446919 
|0.60655737  |80.0  |0.3423793315887451 |0.9698982238769531|
|https://www.wikidata.org/wiki/Q23427|0.36427906  |0.20430791 
|0.597   |75.0  |0.35162511467933655|0.9493570327758789|
|https://www.wikidata.org/wiki/Q170174   |0.3615961   |0.16795586 
|0.653   |62.0  |0.34746548533439636|0.8789964914321899|
|https://www.wikidata.org/wiki/Q43287|0.3493883   |0.16088052 
|0.639   |70.0  |0.33629995584487915|0.9208924174308777|
|https://www.wikidata.org/wiki/Q7473516  |0.34005976  |0.13500817 |0.734375 
   |59.0  |0.33547067642211914|0.8670973181724548|
|https://www.wikidata.org/wiki/Q28179|0.33763435  |0.12792718 
|0.6805556   |58.0  |0.325609028339386  |0.8516228199005127|
|https://www.wikidata.org/wiki/Q12280|0.33676738  |0.21568704 
|0.6069182   |85.0  |0.3385910391807556 |1.0   |
|https://www.wikidata.org/wiki/Q40362|0.33630428  |0.13121563 
|0.60952383  |70.0  |0.31699275970458984|0.9111775159835815|
|https://www.wikidata.org/wiki/Q81931|0.33004636  |0.26839557 
|0.6395349   |85.0  |0.3518647849559784 |1.0   |
|https://www.wikidata.org/wiki/Q7318 |0.32431 
|0.124200575|0.6560284   |61.0  |0.31338077783584595|0.8646677136421204|
|https://www.wikidata.org/wiki/Q133356   |0.32018384  
|0.105077215|0.6369048   |62.0  |0.30359283089637756|0.8645642399787903|
|https://www.wikidata.org/wiki/Q39473|0.30660433  |0.13555892 
|0.6276  |66.0  |0.30183497071266174|0.890287458896637 |
|https://www.wikidata.org/wiki/Q4948 |0.30622554  |0.16706112 
|0.59638554  |64.0  |0.3058503270149231 |0.878718376159668 |
|https://www.wikidata.org/wiki/Q35666|0.30606884  |0.30240446 
|0.56299216  |76.0  |0.33634650707244873|0.9609817862510681|
|https://www.wikidata.org/wiki/Q5684 |0.2750341   |0.24169385 
|0.6298077   |70.0  |0.3096027374267578 |0.9274186491966248|
|https://www.wikidata.org/wiki/Q170468   |0.27139774  |0.14692228 
|0.6081081   |67.0  |0.2804390490055084 |0.8926790356636047|
|https://www.wikidata.org/wiki/Q180573   |0.26850355  |0.12669751 
|0.6474359   |58.0  |0.2782377600669861 |0.8423773050308228|
|https://www.wikidat

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-07-21 Thread Isaac
Isaac added a comment.


  Oooh and the job worked! High-level data on overlap between the two scores 
where they are the same except completeness just takes into account how many of 
the expected claims/refs/labels are there and quality adds the total number of 
claims to the features too:
  
+--+-+-+
|completeness_label|quality_label|num_items|
+--+-+-+
|D |D|29955491 |
|A |C|28315614 |
|A |B|14986978 |
|D |C|11287166 |
|E |D|6428229  |
|E |E|4929743  |
|A |D|3697974  |
|D |E|1760575  |
|D |B|1361759  |
|D |A|207834   |
|E |C|55665|
|A |A|45423|
|E |B|2087 |
|E |A|241  |
|A |E|6|
+--+-+-+

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, KinneretG, Astuthiodit_1, 
YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-07-21 Thread Isaac
Isaac added a comment.


  Updates:
  
  - Finally ported all the code from the API to work on the cluster. I don't 
know if it'll run to completeness yet but I ran it on a subset and the results 
largely matched the API: 
https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/annotation-gap/wikidata-completeness.ipynb
- Notably, I got rid of the statsmodel ordinal logistic regression 
dependency which was painful and just take the parameters/thresholds from the 
model and do the math myself.
  - Next step will be running this fully or on a sample of data and then 
choosing a sample of items to provide to raters to compare the scores and 
choose whether the quality or completeness models seems to best capture the 
concept of "this Wikidata item is in good shape".

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, KinneretG, Astuthiodit_1, 
YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-06-30 Thread Isaac
Isaac added a comment.


  Updates:
  
  - Wrestling with re-adapting everything to the cluster but making good 
progress. One of the main challenges is that the wikidata item schema is 
different between cluster and API so lots of little errors that I'm having to 
discover and correct as I make that adaptation in source data.

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, KinneretG, Astuthiodit_1, 
YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-06-23 Thread Isaac
Isaac added a comment.


  Updates:
  
  - Successfully generated the property data I need so now I have the necessary 
data to run the model in bulk on the cluster and can turn towards generating a 
dataset for sampling. Notebook: 
https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/annotation-gap/generate_wikidata_propertyfreq_data.ipynb

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, KinneretG, Astuthiodit_1, 
YLiou_WMF, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-06-16 Thread Isaac
Isaac added a comment.


  Updates:
  
  - Began process of regenerating property-frequency table on cluster given 
that we shouldn't depend on RECOIN for bulk computation even if it greatly 
simplifies the API prototype. Working out a few bugs but feel like I have the 
right approach and relatively simple.

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, YLiou_WMF, 
karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-05-12 Thread Isaac
Isaac added a comment.


  No updates still with prep for wikiworkshop/hackathon but after next week, 
hoping to get back to this!

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, 
karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-04-11 Thread Isaac
Isaac added a comment.


  From discussion with Lydia/Diego:
  
  - The concept of `completeness` feels closer to what we want than `quality` 
-- i.e. allowing for more nuance in how many statements are associated with a 
given item. We came up with a few ideas for how to make assessing item 
completeness easier (because otherwise it would require very extensive 
knowledge of a domain area to know how many statements should be associated 
with an item): I suggested providing the completeness score and quality score 
and asking the evaluator which was more appropriate but I like Lydia's idea 
better which was to just provide the completeness score and ask the evaluator 
if they felt that the actual score was lower, the same, or higher.
  - Putting together a dataset like this would be fairly straightforward -- the 
main challenge is having a nice stratified dataset and one that provides 
information on top of the original quality-oriented dataset. For example, for 
highly-extensive items, both models tend to agree that the item is A-class so 
collecting a lot more annotations won't tell us much. It's only for the shorter 
items where we begin to see discrepancies and so that's where we should 
probably focus our efforts. Plus because the model is very specific to the 
instance-of/occupation properties, we should make sure to have a diversity of 
items by those properties. This is my main TODO.
  - I read through the paper 
<https://link.springer.com/chapter/10.1007/978-3-030-49461-2_11> describing the 
new proposed Wikidata Property Suggester approach. My understanding of the 
existing item-completeness/recommender systems:
- Existing Wikidata Property Suggester: make recommendations for properties 
to add based on statistics on co-occurrence of properties. Ignores values of 
these properties except for instance-of/subclass-of where the statistics are 
based on the value. Recommendations are ranked by probability of co-occurrence.
- Recoin: similar to above but only uses instance-of property for 
determining missing properties and adds in refinement of which occupation the 
item has if it's a human.
- Proposed Wikidata Property Suggester: more advanced system for finding 
likely co-occurring properties based on more fine-grained association rules -- 
i.e. doesn't just merge all the individual "if Property A -> Property B k% of 
the time" but instead does things like 'if Property A and Property B and ... -> 
Property N k% of the time". Also takes into account instance-of/subclass-of 
property values like the existing suggester. This seems like a pretty 
reasonable enhancement and their approach is quite lightweight (~1.5GB RAM for 
holding data structure).
  - I am following the Recoin approach in my model though if the new Property 
Suggester proves successful and provides the data needed to incorporate into 
the model (a list of likely missing properties + confidence scores), it would 
be very reasonable to incorporate that in in place of the Recoin model at a 
later point and also solve some of the problems that @diego was considering 
addressing via wikidata embeddings (more nuanced recommendations of missing 
properties).

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, 
karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-03-24 Thread Isaac
Isaac added a comment.


  Updated API to be slightly more robust to instance-of-only edge cases and 
provide the individual features. Output for 
https://wikidata-quality.wmcloud.org/api/item-scores?qid=Q67559155:
  
{
  "item": "https://www.wikidata.org/wiki/Q67559155";,
  "features": {
"ref-completeness": 0.9055531797461024,
"claim-completeness": 0.903502532415779,
"label-desc-completeness": 1.0,
"num-claims": 11
  },
  "predicted-completeness": "A",
  "predicted-quality": "C"
}
  
  Details:
  
  - `ref-completeness`: what proportion of expected references does the item 
have? References that are internal to Wikimedia are only given half-credit 
while external links / identifiers are given full credit. Based on what 
proportion of claims for a given property typically have references on 
Wikidata. Also takes into account missing statements.
  - `claim-completeness`: what proportion of the expected claims does the item 
have. Data taken from Recoin <https://www.wikidata.org/wiki/Wikidata:Recoin> 
where less common properties for a given instance-of are weighted less.
  - `label-desc-completeness`: what proportion of expected labels/descriptions 
are present. Right now the expected labels/descriptions are English plus any 
language for which the item has a sitelink.
  - `num-claims`: how many total properties the item has actually so it's a 
misnomer and something I'll fix at some point (I don't give more credit for 
e.g., having 3 authors instead of 1 author for a scientific paper)
  - `predicted-completeness`: E (worst) to A (best) based on (see guidelines 
<https://www.wikidata.org/wiki/Wikidata:Item_quality>), which uses just the 
proportional `*-completeness` features.
  - `predicted-quality`: same classes but now also includes the more generic 
`num-claims` feature too.
  
  Regarding T332021 <https://phabricator.wikimedia.org/T332021>, I'll have to 
think about how to count that for the label-desc score. Probably no change for 
descriptions but for labels, perhaps accept it in place of English but still 
expect language-specific labels for any languages that have a sitelink? Either 
way, label/descriptions are not a major feature so it won't greatly affect the 
model.

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, 
karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-03-17 Thread Isaac
Isaac added a comment.


  I still need to do some checks because I know e.g., this fails when the item 
lacks statements, but I put together an API for testing the model. It has two 
outputs: a quality class (E worst to A best) that uses the number of claims on 
the item as a feature (along with labels/refs/claims completeness) and 
corresponds very closely to ORES model outputs and the annotated data, and, a 
completeness class (same set of labels) that does not include the number of 
claims as a feature and so is more a measure of how complete an item is (a la 
the Recoin approach).
  
  Example: https://wikidata-quality.wmcloud.org/api/item-scores?qid=Q67559155

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Michael, Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, 
karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Abdeaitali, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
Avner, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-03-10 Thread Isaac
Isaac added a comment.


  Weekly updates:
  
  - Discussed with Diego the challenge of whether our annotated data is really 
assessing what we want it to. I'll try to join the next meeting with Lydia to 
hear more and figure out our options.
  - Diego is also considering how embeddings might help with better missing 
property / out-of-date property / quality predictions for Wikidata subgraphs 
where we have a lot more data and the sorts of properties you might expect 
varies at finer-grained levels than just instance-of/occupation. For examples, 
instances where e.g., country of citizenship or age might further mediate what 
claims you'd expect. This could also be useful for fine-grained similarity to 
e.g., identify similar Wikidata items to use as examples or also improve.

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, 
Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-03-03 Thread Isaac
Isaac added a comment.


  I slightly tweaked the model but also experimented with adding just a simple 
square-root of the number of existing claims to the model and found that that 
is essentially that's all that is needed to almost match ORES quality (which is 
near perfect) for predicting item quality. That said, I think this is mainly an 
issue with the assessment data as opposed to Wikidata quality really just being 
about the number of statements. For example, the dataset has many Wikidata 
items that are for disambiguation pages and they're almost all rated E-class 
(lowest) because their only property is their instance-of. I'd argue though 
that that's perfectly acceptable for almost all disambiguation pages and these 
items are nearly complete even with just that one property (you can see the 
frequency of other properties that occur for these pages but they're pretty 
low: https://recoin.toolforge.org/getbyclassid.php?subject=Q4167410&n=200). So 
while the number of claims is a useful feature for matching human perception of 
quality, I think we'd actually want to leave it out to get closer to the 
concept of "to what degree is an item missing major information". Where most 
disambiguation pages would do just fine here but human items that have many 
more statements (but also a much higher expectation) wouldn't do as well.
  
  Notebook: 
https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/v2_eval_wikidata_quality_model.ipynb
  Quick summary:
  
38.7% correct (62.6% within 1 class) using features ['label_s'].
56.7% correct (77.0% within 1 class) using features ['claim_s'].
44.8% correct (72.7% within 1 class) using features ['ref_s'].
77.3% correct (98.1% within 1 class) using features ['sqrt_num_claims'].
55.0% correct (75.3% within 1 class) using features ['label_s', 'claim_s'].
50.2% correct (74.5% within 1 class) using features ['label_s', 'ref_s'].
76.5% correct (98.4% within 1 class) using features ['label_s', 
'sqrt_num_claims'].
54.2% correct (76.6% within 1 class) using features ['label_s', 'claim_s', 
'ref_s'].
75.1% correct (98.3% within 1 class) using features ['label_s', 'claim_s', 
'sqrt_num_claims'].
79.4% correct (97.7% within 1 class) using features ['label_s', 'ref_s', 
'sqrt_num_claims'].
55.0% correct (78.4% within 1 class) using features ['claim_s', 'ref_s'].
75.3% correct (98.0% within 1 class) using features ['claim_s', 
'sqrt_num_claims'].
78.8% correct (98.3% within 1 class) using features ['claim_s', 'ref_s', 
'sqrt_num_claims'].
79.4% correct (98.7% within 1 class) using features ['ref_s', 
'sqrt_num_claims'].
78.3% correct (97.9% within 1 class) using features ['label_s', 'claim_s', 
'ref_s', 'sqrt_num_claims']

ORES is at (remembering it's trained on 2x more data including what I'm 
evaluating it on here):
87.1% correct and 98.3% within 1 class

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, 
Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-02-16 Thread Isaac
Isaac added a comment.


  Weekly update:
  
  - I cleaned up the results notebook 
<https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/eval_wikidata_quality_model.ipynb#Results>.
 The original ORES model does better on the labeled data than my initial model. 
This isn't a big surprise -- it was trained directly on them and uses many more 
features. A few takeaways:
- I think one salient thing in comparing feature lists to take from the 
ORES model is boosting the importance of having an image if that's a common 
property for similar items.
- The real perceived benefit of this new model will be its simplicity and 
flexibility. If we had updated test data, I think the new model would perform 
much better comparatively because it shouldn't go stale in the same way the 
ORES model would go because I'm not hard-coding lots of rules but allowing the 
model to adapt and learn from the current state of Wikidata.
- The ordinal logistic regression approach that I used might also not be 
working well. I never really planned to keep it even though it's a good 
theoretical match for the data because I think a simpler classification or 
linear regression model w/ cut-offs would be just as reasonable. I also only 
trained it on about 200 items so I'd have plenty of test data so certainly 
plenty of room to scale that up.
- My model includes no features regarding the actual number of statements. 
They are implicitly included in the completeness proportions (e.g., what 
proportion of expected claims exist) but I suspect humans in labeling items pay 
much more attention to the sheer quantity of statements regardless of what's 
actually expected for an item of a given type. Not sure if this is a drawback 
or not but I like that it theoretically allows for an item to be high quality 
even if it only has a few statements.
  - Other big next step will be considering how to scale up the model so it 
could potentially run on LiftWing if that's desired. It has a few semi-large 
data dependencies and that might pose a challenge.

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, 
Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-02-10 Thread Isaac
Isaac added a comment.


  > Recoin I believe didn't exist at that point. It was also not integrated in 
the existing production systems. I don't think we ever did a proper analysis of 
what it's currently capable of and how good it is for judging Item quality.
  
  Thanks -- useful context. I'll see about evaluating it then and report back. 
I've been working on a prototype that essentially uses Recoin + additional 
rules for labels / references to generate a score. I'll then compare it against 
the labeled data from the original ORES campaign. You can see a super raw 
prototype here (scores at the very bottom of the notebook) but I'd wait a week 
or so until I can generate more interesting figures and actually fine-tune it: 
https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/eval_wikidata_quality_model.ipynb

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, 
Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-01-27 Thread Isaac
Isaac added a comment.


  I started a PAWS notebook where I will evaluate the proposed strategy (Recoin 
with additional of reference/labels rules) against the 2020 dataset (~4k items) 
of assessed Wikidata item qualities. This will allow me to relatively cheapily 
assess the method before trying to scale up.
  
  Notebook: 
https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/eval_wikidata_quality_model.ipynb

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, 
Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-01-24 Thread Isaac
Isaac moved this task from FY2022-23-Research-October-December to 
FY2022-23-Research-January-March on the Research board.
Isaac edited projects, added Research (FY2022-23-Research-January-March); 
removed Research (FY2022-23-Research-October-December).

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

WORKBOARD
  https://phabricator.wikimedia.org/project/board/45/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, 
Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2023-01-12 Thread Isaac
Isaac added a subscriber: Lydia_Pintscher.
Isaac added a comment.


  @Lydia_Pintscher I was reminded recently of Recoin 
<https://www.wikidata.org/wiki/Wikidata:Recoin> (and the closely related 
PropertySuggester <https://www.mediawiki.org/wiki/Extension:PropertySuggester>) 
and that got me wondering: is there a reason that the ORES model was used 
instead of Recoin? Or maybe more specifically, is there any reason not to use 
Recoin for assessing Wikidata item quality? What are its drawbacks?
  
  Looking through it, my impression was that it's quite good and that my 
approach likely would have been very similar. I do see a few places we could 
augment it:
  
  - Also assessing references in a similar way (based on how often a property 
is referenced on other items) to identify claims where references are missing 
or could be improved (e.g., imported from wikipedia)
  - Also assessing labels/descriptions based on which language sitelinks exist 
for the item -- e.g., if Japanese Wikipedia article, should also have Japanese 
label/description
  
  And then I know you asked about Properties / Lexemes -- presumably this same 
strategy could be adopted for them if it's indeed working well for items!

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Lydia_Pintscher, diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, 
Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2022-12-22 Thread Isaac
Isaac added a comment.


  Weekly updates:
  
  - I focused on the references component of the model this week. I built 
heavily on Amaral, Gabriel, Alessandro Piscopo, Lucie-Aimée Kaffee, Odinaldo 
Rodrigues, and Elena Simperl. "Assessing the quality of sources in Wikidata 
across languages: a hybrid approach." Journal of Data and Information Quality 
(JDIQ) 13, no. 4 (2021): 1-35. <https://arxiv.org/pdf/2109.09405.pdf>
  - I wrote a Python function (code below) that takes the references for a 
claim and maps it to high-level categories that tell us about the quality of 
the reference -- e.g., has an External URL associated with it vs. referring to 
internal Wikidata item or import from another Wikimedia project. I can imagine 
weak and strong recommendations based on this -- e.g., high priority would be 
adding missing references and lower priority might be updating Imported from 
Wikimedia Project to a external URL and very low priority might be adding a 
second reference.
  - Using that function, I can generate basic descriptive stats on reference 
distributions on Wikidata (table below) and split by property 
(top-100-most-common properties below). From this data, you can see that we 
might be able to automatically infer which properties definitely need 
references, which ones probably should have references, and which ones probably 
don't by just setting some basic heuristics. One challenge will be whether we 
use the current state of Wikidata (which is heavily bot-influenced so for 
certain properties, reflects the choice of a few people) or try to build a more 
nuanced dataset based on edit history of which properties have references when 
editors add them.
  
# Code for categorizing references for a claim per a simple taxonomy that 
by proxy tells us something about authority/accessibility/usefulness of the 
reference
# types of references from least -> best
# so if a claim has two references and one is Internal-Stated and one is 
External-Direct, we keep External-Direct
REF_ORDER = {r:i for r,i in enumerate(
['Internal-Inferred', 'Internal-Stated', 'Internal-Wikimedia',
 'External-Identifier', 'External-Direct'])}

EXTERNAL_ID_PROPERTIES = set()
# all Wikidata properties that are external IDs -- used for detecting when 
used as part of a reference
# TODO: Maybe update to SPARQL query that is external identifier properties 
ONLY with URL formatter properties? (maybe that's essentially the same thing?)
# https://quarry.wmcloud.org/query/69919
with open('quarry-69919-wikidata-external-ids-run692643.tsv', 'r') as fin:
for line in fin:
EXTERNAL_ID_PROPERTIES.add(f'P{line.strip()}')

def getReferenceType(references):
"""Map references for a claim to different categories.

Heavily inspired by: https://arxiv.org/pdf/2109.09405.pdf
Also: https://www.wikidata.org/wiki/Help:Sources
"""
if references is None:
ref_count = 'unreferenced'
best_ref_type = None
else:
ref_count = 'single' if len(references) == 1 else 'multiple'
best_ref_types = []
for ref in references:
# reference URL OR official website OR archive URL OR URL OR 
external data available at 
if 'P854' in ref['snaksOrder'] or 'P856' in ref['snaksOrder'] 
or 'P1065' in ref['snaksOrder'] or 'P953' in ref['snaksOrder'] or 'P2699' in 
ref['snaksOrder'] or 'P1325' in ref['snaksOrder']:
best_ref_types.append('External-Direct')
break
elif [p for p in ref['snaksOrder'] if p in 
EXTERNAL_ID_PROPERTIES]:
best_ref_types.append('External-Identifier')
# Wikimedia import URL OR imported from Wikimedia project
elif 'P4656' in ref['snaksOrder'] or 'P143' in 
ref['snaksOrder']:
best_ref_types.append('Internal-Wikimedia')
# stated in
elif 'P248' in ref['snaksOrder']:
best_ref_types.append('Internal-Stated')
# inferred from Wikidata item OR based on heuristic OR based on
elif 'P3452' in ref['snaksOrder'] or 'P887' in 
ref['snaksOrder'] or 'P144' in ref['snaksOrder']:
best_ref_types.append('Internal-Inferred')
# title OR published in -- hard to interpret without more info 
but probably links to Wikidata item
elif 'P

[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2022-12-16 Thread Isaac
Isaac added a comment.


  Able to start thinking about this again and a few thoughts:
  
  - Machine-in-the-loop: when we built quality models for the Wikipedia 
language communities, it was with the idea that the models could potentially 
support the existing editor processes for assigning article quality scores -- 
e.g., https://en.wikipedia.org/wiki/Wikipedia:Content_assessment. This 
generally aligns with our machine-in-the-loop practice of only building models 
that clearly could support and receive feedback from existing community 
processes. For the Wikidata, while there are reasonable guidelines 
<https://www.wikidata.org/wiki/Wikidata:Item_quality> for item quality, the 
only community-generated data was a one-off labeling campaign from 2020 via 
Wiki labels <https://meta.wikimedia.org/wiki/Wiki_labels/en>. This presents a 
major challenge: how do we improve on the existing ORES model to make it more 
maintainable / effective without a clear feedback loop that can be used to 
validate/update the model? One possible approach is to instead treat this as a 
task-identification model -- i.e. instead of seeking to model quality directly 
and therefore allowing vague features like the total # of references, we could 
design a model that seeks to explicitly build a list of missing/to-be-improved 
properties/aliases/descriptions/references. This list of changes could then 
always be converted into a quality score -- e.g., by computing a simple ratio 
of existing properties to missing properties or something like that -- but that 
would be secondary to the model. The community process that can provide 
feedback for this style of model then is just the regular editing process 
(albeit quite weakly because an edit doesn't tell you what else is missing). 
Eventually, it could feed into an actual interface similar to the Growth team's 
structured tasks 
<https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks> 
that would provide even more direct feedback, but in the meantime this still 
feels much more machine-in-the-loop than a direct quality model.
  - Reducing data drift: alongside this shift in design from quality -> task 
identification, we can also make the model more sustainable by doing less 
hard-coding of outliers (like asteroids 
<https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/wikidatawiki_data/items_lists.py>)
 and try to redesign the model to adapt to the existing structure of Wikidata 
when it is trained. For example, taking more the approach previously taken for 
external identifiers / media 
<https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/wikidatawiki_data/property_datatypes.py>
 where the relevant data structures that inform the model are easy to 
auto-generate and thus could be updated with each model training. This could be 
extended to e.g., lists of properties that commonly have references and lists 
of properties that commonly appear for a given instance-of.
- Then the model would take an item as input and perhaps go something like:
  - Extract it's instance-of and sitelinks
  - Sitelinks would be used to help determine which aliases/descriptions 
should exist
  - Instance-ofs would be used to identify which properties are expected
  - For each of those expected properties, it would either be rated as 
missing, incomplete (missing reference etc.), or complete
  - And then all of this information could be compiled as specific tasks
  - And for the quality score, the list of tasks could be compared against 
the existing data to come to some general score.
- The challenge then still is in the smart compiling of expected properties 
for a given instance-of, but I feel much better about the structure of this 
model because it's more transparent and anyone who is familiar with Wikidata 
could easily inspect the list of expected properties for a given instance-of 
and tweak it.
- I'm now working on extracting the list of existing properties for each 
instance-of to see if most have a clear set of common properties

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Lydia_Pintscher, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2022-12-02 Thread Isaac
Isaac added a comment.


  Update: past few weeks have been busy so I haven't had a chance to look into 
this but I'm hoping to get more time in December to focus on it.

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Lydia_Pintscher, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T321224: Wikidata Item Quality Model

2022-11-04 Thread Isaac
Isaac added a comment.


  Weekly update:
  
  - Summarizing some past research shared / further examinations of the 
existing ORES model shared by LP:
- We have to be careful to adjust expectations for a given claim depending 
on its property type (distribution of property types on Wikidata 
<https://quarry.wmcloud.org/query/68563>) -- e.g., no references for 
`external-id` properties. Current model uses a static list for this 
<https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/wikidatawiki_data/property_datatypes.py>
 but we might want to re-evaluate.
- Even though number of sitelinks might correlate positively with quality, 
it's a feature we should avoid as it's really a proxy for popularity and not 
item quality
- Wikidata is constantly shifting in big ways and out-of-date data / rules 
can lead to models handling particular instance-ofs poorly. We should do our 
best to make aspects of the model unsupervised or not dependent on a fixed set 
of data so it can adapt easily.
- The current model is actually pretty good so maybe this is less about 
iterating on it significantly and more about thinking about redesigning it for 
new LiftWing paradigm and to be less susceptible to data drift.
  - Something I've been mulling over is how to ensure the model is actionable 
in a way that aligns with community goals and points to specific steps a 
contributor could take to raise quality.
- For instance, adding/improving references is quite actionable and 
important. For the verifiability component then, it's worthwhile to ensure that 
the model handles this well -- i.e. has a good sense of which statements do and 
do not need references and differentiates between the different types of 
references (external vs. Wikipedia).
- If we're less concerned about making items super extensive but do want to 
"require" a core set of basic properties (similar to Schemas or inteGraality 
<https://wikitech.wikimedia.org/wiki/Tool:InteGraality>), we might try to 
identify that core set of properties for each instance-of and try not to rely 
less on raw counts of statements in determining scores.
- What about consistency -- is there some way to capture how well an item 
matches related ones? And if so, should an item be penalized for being "unique"?
  - LP also asked us to consider how to extend this to Lexemes and Properties. 
Will have to think through that and whether we can reuse some of the resulting 
model for those item types or if they require fully separate approaches.

TASK DETAIL
  https://phabricator.wikimedia.org/T321224

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: diego, Miriam, Isaac, Astuthiodit_1, karapayneWMDE, Invadibot, Ywats0ns, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Abdeaitali, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-28 Thread Isaac
Isaac closed this task as "Resolved".

TASK DETAIL
  https://phabricator.wikimedia.org/T249654

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, 
GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, 
Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-21 Thread Isaac
Isaac updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T249654

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, 
GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, 
Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-21 Thread Isaac
Isaac added a comment.


  Weekly update:
  
  - cleaned up the meta page a little: 
https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion
  - this task is essentially done but I'm going to leave the task open at least 
another week to allow for continued discussion
  - further research steps in this space would be:
- repeating the analysis in a language like Japanese which shows very 
little transclusion and a language like Catalan that presumably has much more.
- automating the infobox portion of this analysis for enwiki
- moving to the question raised in T246709 
<https://phabricator.wikimedia.org/T246709> which is essentially applying the 
same taxonomy (high-, medium-, low-, no-importance) to Wikidata changes that 
show up in RecentChanges feeds to see how much "noise" appears in them and the 
best ways to provide better filters there.

TASK DETAIL
  https://phabricator.wikimedia.org/T249654

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, 
GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, 
Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-13 Thread Isaac
Isaac added a comment.


  @GoranSMilovanovic thanks! I'm pretty open on next steps. This work was done 
in part to help guide interpretation of potential WMF metrics around measuring 
transclusion but I would love to see some improvements made to the way we 
monitor transclusion if possible too. You'll have to let me know what you see 
as feasible / reasonable changes though and what I can do to help make them 
happen. In T246709#6367012 <https://phabricator.wikimedia.org/T246709#6367012> 
I noted that there are two potential improvements I could see made based on my 
very limited knowledge of how lua / these tables work:
  
  - Distinguishing between standard statements and identifiers in Lua calls. If 
this then was reflected in wbc_entity_usage, it would be much easier to 
distinguish between transclusion that is part of linked open data and 
transclusion that is facts like birthday etc. It would also substantially 
reduce noise in Recent Changes because, at least in English Wikipedia, the very 
common metadata templates like Authority Control 
<https://en.wikipedia.org/wiki/Template:Authority_control> and Taxonbar 
<https://en.wikipedia.org/wiki/Template:Taxonbar> trigger a general C aspect 
and so changes to any part of the Wikidata item show up in Recent Changes even 
when they have no impact on the article. In theory, a filter could be added to 
Recent Changes then to change how changes to identifiers show up in the feed.
  - I'm not sure if it's possible to distinguish between transclusion and 
tracking in the wbc_entity_usage table -- e.g., a parameter that can be passed 
with lua calls that indicates that the property is only being used for 
tracking. This might just be a hacky change that long-term isn't useful, but 
tracking categories generate a lot of the entries in the wbc_entity_usage table 
and are quite different in impact than transclusion.

TASK DETAIL
  https://phabricator.wikimedia.org/T249654

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, 
GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, 
Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-08-06 Thread Isaac
Isaac added a comment.


  > Thank you for this analysis - really useful!
  
  Thanks! Glad to hear :)
  
  Additionally, I made some notes here about how these findings my inform 
patrolling of Wikidata transclusion (T246709#6367012 
<https://phabricator.wikimedia.org/T246709#6367012>) and am working on 
hopefully writing this up for the Wikidata Workshop.

TASK DETAIL
  https://phabricator.wikimedia.org/T249654

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, 
GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, 
Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?

2020-08-06 Thread Isaac
Isaac added a comment.


  The results reported in T249654#6352573 
<https://phabricator.wikimedia.org/T249654#6352573> have some potential insight 
into how we think about supporting patrolling of Wikidata transclusion within 
Wikipedia articles so I wanted to record some of my initial thoughts here. We 
would want to talk with patrollers before actually thinking about implementing 
any of these and unfortunately I'm not actually working on this aspect of the 
project at the moment. However: the recent changes feed for a given article 
likely has many more Wikidata-related changes than are actually pertinent to an 
article from a patrolling standpoint. Some thoughts on reducing this noise:
  
  - Many entries to wbc_entity_usage are from transclusion that only generates 
tracking categories (e.g., Category:Coordinates on Wikidata 
<https://en.wikipedia.org/wiki/Category:Coordinates_on_Wikidata>) so arguably 
there should be a way to mark events on Recent Changes caused by these as 
tracking-only so patrollers could easily ignore them.
  - Many entries to wbc_entity_usage are from metadata templates like Authority 
Control and Taxonbar that are very valuable from a linked-data perspective but 
less from a reader's perspective and have a very low potential for harmful 
vandalism. Because the way both of these templates are written, they also 
trigger a general "statements" aspect usage, so any changes to statements on 
the Wikidata item would trigger an event on recent changes. This adds a bunch 
of noise to the Recent Changes feed from Wikidata where these templates are 
used. Additionally, in reality, changes to Wikidata identifiers that impact 
Authority Control and Taxonbar have a very low likelihood of being problematic 
from a reader's perspective because the external links that are generated via 
these templates go to well-curated repositories of information so the reader 
should quickly realize the link is incorrect and probably won't end up viewing 
offensive material. Ideally these templates would be rewritten to only trigger 
the specific properties they transclude, but in practice I could see that being 
difficult, inefficient, or causing the wbc_entity_usage table to become far too 
large to be practical (as each usage of Authority Control would trigger close 
to 100 rows, 1 for each property that can be transcluded 
<https://en.wikipedia.org/wiki/Template:Authority_control#Wikidata>). Instead, 
maybe wbc_entity_usage could be expanded to distinguish between general 
statements (C.S?) and identifiers (C.I?)? This would make filtering out changes 
to identifiers far easier and metadata templates then could still be recorded 
simply without causing every change to date of birth, occupation, etc. to also 
trigger a change. Unfortunately, I suspect this would require making 
non-trivial changes to the Lua modules and then convincing template coders to 
adapt the code.
  - Some entries to wbc_entity_usage go to generating external links that could 
more clearly generate harm if vandalized and probably do warrant focus from 
patrollers. For instance, Wikidata templates that generate links to Commons 
categories or external links to IMDb etc. could more clearly be abused to link 
to offensive material. Thankfully, given the specific nature of these 
templates, they generally are recorded with their specific property and so 
don't generate noise for patrollers. That said, a not insignificant amount of 
their usage (on enwiki) is only for tracking categories, so any changes that 
would distinguish between actual transclusion and tracking categories would 
serve to reduce noise for this.
  - Finally, infobox transclusion has probably the greatest potential for harm 
(e.g., falsifying someone's age or where they were born). This seems to be 
tracked pretty well for most infoboxes (the specific properties each get their 
own row and labels for each item that was actually transcluded) so I think it's 
more about reducing the noise from the above so that patrollers can more easily 
see these changes.

TASK DETAIL
  https://phabricator.wikimedia.org/T246709

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, Akuckartz, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T249654: Categorize different types of Wikidata re-use within Wikimedia projects

2020-07-31 Thread Isaac
Isaac added a comment.


  > This is "overall articles for all projects", correct?
  
  It's actually just for English Wikipedia. The number from the WMDE dashboard 
<https://wmdeanalytics.wmflabs.org/WD_percentUsageDashboard/> for all Wikipedia 
projects is 31.99% (i.e. the inverse of the 68.01% number provided under "% of 
Articles that use Wikidata" in the tinier table that aggregates each project 
family). It varies a lot by wiki too -- vecwiki seems to have almost every 
article with some form of Wikidata transclusion whereas 62% of articles on 
Japanese Wikipedia don't have a single Wikidata-based template. This data was 
only recently added there (see T257962 
<https://phabricator.wikimedia.org/T257962>).
  
  > How is this calculated?
  > Do you have your selects to group by "importance" using wbc_entity_usage?
  
  Wikidata description usage isn't tracked on wbc_entity_usage as far as I can 
tell so can't be queried in any straightforward way. The way I reached the 54% 
number is that I checked each article in my sample to see whether the 
description was from Wikidata using the gadget mentioned here 
<https://en.wikipedia.org/wiki/Wikipedia:Short_description#Making_it_visible_in_the_page>.
 On enwiki at least, Wikidata is the default unless there is a short 
description provided on the page, which supposedly is tracked by this category 
<https://en.wikipedia.org/wiki/Category:Articles_with_short_description> (which 
is how I verified this number -- you can see that the category has 2.1M pages 
in it, so about 1/3 of articles overwrite the Wikidata description). That said, 
maybe 10-20% of articles that used Wikidata didn't actually show a description 
because it hadn't been added in Wikidata yet.
  
  > And, forgot to say, THIS IS SUPER USEFUL, thanks!
  
  Thanks!!

TASK DETAIL
  https://phabricator.wikimedia.org/T249654

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Akuckartz, calbon, Addshore, Lydia_Pintscher, Nuria, MGerlach, 
GoranSMilovanovic, Isaac, Liuxinyu970226, darthmon_wmde, Nandana, Abdeaitali, 
Lahi, Gq86, QZanden, LawExplorer, Avner, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Capt_Swing, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?

2020-03-03 Thread Isaac
Isaac added a comment.


  @Lydia_Pintscher that makes sense and thanks for reaching out. I'm not going 
to schedule the meeting right now because I don't want to use up your time if 
we don't end up prioritizing this work, but when we do, I'll reach out!

TASK DETAIL
  https://phabricator.wikimedia.org/T246709

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?

2020-03-02 Thread Isaac
Isaac added a comment.


  > If we have a concrete example to look at I can try to figure that out :)
  
  Actually, I think I found the reason for most of the pages: 
https://en.wikipedia.org/wiki/Template:Authority_control
  It's generic because it pulls any external identifiers so can't be defined in 
advance which ones will be transcluded. And while "noise' from this example 
could be reduced by somehow indicating in wbc_entity_usage whether the 
properties used are identifiers or statements, I recognize that that doesn't 
solve the larger challenge.

TASK DETAIL
  https://phabricator.wikimedia.org/T246709

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Retitled] T246709: What proportion of a Wikipedia article's edit history might reasonably be changes via Wikidata transclusion?

2020-03-02 Thread Isaac
Isaac renamed this task from "What percentage of edits via Wikidata 
transclusion are missing on Recent Changes?" to "What proportion of a Wikipedia 
article's edit history might reasonably be changes via Wikidata transclusion?".

TASK DETAIL
  https://phabricator.wikimedia.org/T246709

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T246709: What percentage of edits via Wikidata transclusion are missing on Recent Changes?

2020-03-02 Thread Isaac
Isaac added a comment.


  Thanks for the additional details @Addshore !
  
  Some context: this task isn't being worked right now. I just created it as a 
potential future analysis because I had just become aware that Wikidata item 
properties were tracked specifically in wbc_entity_usage and think some good 
numbers on this would be valuable to track.
  
  Based on what you said, I think I should probably rename this task to focus 
on the edit history as that's closer to what I'm actually interested in (I now 
see the confusion that the current title causes) -- i.e. edits that originate 
on Wikidata and actually change the content of an associated Wikipedia article. 
The Wikidata tracking in Recent Changes feed seems to be still far too noisy 
for this purpose. Looking at English Wikipedia for example, the challenge with 
the Recent Changes feed of Wikidata edits 
<https://en.wikipedia.org/wiki/Special:RecentChanges?hidebots=1&hidepageedits=1&hidenewpages=1&hidecategorization=1&hidelog=1&limit=50&days=30&urlversion=2>
 is that almost none of those changes (I couldn't actually find any in my quick 
checking) actually affected the content of the page. Sitelinks obviously matter 
but I'm not considering them at the moment. Almost all the property changes in 
that feed are surfaced because the associated Wikipedia article has a generic C 
property on the wbc_entity_usage table (as opposed to C.P indicating that 
specific properties are being transcluded). I don't actually understand why 
they have that C property listed on wbc_entity_usage.
  
  > The approach described in the description is likely to get you some fairly 
unreliable data.
  
  Could you provide some more details here?

TASK DETAIL
  https://phabricator.wikimedia.org/T246709

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: hoo, Ladsgroup, Lydia_Pintscher, Addshore, Capt_Swing, Isaac, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T209655: Copy Wikidata dumps to HDFS

2020-01-14 Thread Isaac
Isaac added a comment.


  > @JAllemandou Thank you - as ever!
  
  +1: these wikidata parquet (specifically item_page_link) dumps are super 
useful for us!

TASK DETAIL
  https://phabricator.wikimedia.org/T209655

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Isaac, Groceryheist, MGerlach, WMDE-leszek, abian, leila, Ottomata, Nuria, 
GoranSMilovanovic, Addshore, JAllemandou, bmansurov, 4748kitoko, darthmon_wmde, 
Nandana, Akovalyov, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, terrrydactyl, Wikidata-bugs, aude, Capt_Swing, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-26 Thread Isaac
Isaac added a comment.


  Hey @JAllemandou - this is great! thanks for catching that - looks all good 
to me now too.

TASK DETAIL
  https://phabricator.wikimedia.org/T215616

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Marostegui, Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, 
JAllemandou, diego, Nandana, Akovalyov, Banyek, Rayssa-, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, Wikidata-bugs, aude, 
Capt_Swing, Dinoguy1000, Mbch331, Jay8g, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-25 Thread Isaac
Isaac added a comment.


  Hey @JAllemandou, some debugging: a number of items aren't showing up and I 
can't for the life of me figure out. The few I've looked at are pretty normal 
articles (for example: https://de.wikipedia.org/wiki/Gregor_Grillemeier) and 
show up in the original parquet files 
(`/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204`)
  
  But according to this analysis (T209891#4798717 
<https://phabricator.wikimedia.org/T209891#4798717>) and ebernhardson's table 
(`SELECT count(page_id) from ebernhardson.cirrus2hive where wikiid = 'enwiki' 
and dump_date='20190121';`), there should be ~5.7 million english articles w/ 
associated wikidata items and I'm only seeing 916 thousand. I went through your 
query but could not find anything that would be causing this dropout so I'm at 
a loss. Thoughts?
  
  Code in case I'm doing something wrong:
  
count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP 
BY wiki_db')
wikidataParquetPath = 
'/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204'
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView('wikidata')
count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP 
BY wiki_db')
  
  If you sort the outcome then, you get:
  
+--++
|   wiki_db|count(1)|
+--++
|zhwiki| 1245854|
|jawiki| 1210483|
|enwiki|  916393|
|   cebwiki|  891045|
|svwiki|  778952|
|dewiki|  656622|
|frwiki|  414492|
|nlwiki|  414469|
|ruwiki|  413733|
...

TASK DETAIL
  https://phabricator.wikimedia.org/T215616

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Isaac
Cc: Marostegui, Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, 
JAllemandou, diego, Nandana, Akovalyov, Banyek, Rayssa-, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, Avner, _jensen, Wikidata-bugs, aude, 
Capt_Swing, Dinoguy1000, Mbch331, Jay8g, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread Isaac
Isaac added a comment.
@diego: my interpretation is that right now in the revision history version, the same wikidb/page ID/title is associated with the same wikidata ID regardless of when the revision occurred. what is the use for that over a table that has just one entry per wikidb/page ID/title? i'm trying to understand so i don't end up making a mistake about my interpretation of the linksTASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: IsaacCc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread Isaac
Isaac added a comment.
thank you @JAllemandou this is awesome!!! completely unblocks me (i have a bunch of page titles across all the wikipedias and need to check whether a pair of them match the same wikidata item)!TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: IsaacCc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T215413: Image Classification Working Group

2019-02-07 Thread Isaac
Isaac added a comment.
If we go down that pathway of trying to identify what images are photographs, we should look into work by a former colleague of mine on detecting visualizations on Commons (in some ways, the inverse task): http://brenthecht.com/publications/www18_vizbywiki.pdf

He (Allen Lin) might have some insight into some easy wins or pitfalls in building a model like that.TASK DETAILhttps://phabricator.wikimedia.org/T215413EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Miriam, IsaacCc: Mholloway, PDrouin-WMF, Krenair, d.astrikov, JoeWalsh, Nirzar, dcausse, fgiunchedi, JAllemandou, leila, Capt_Swing, mpopov, Nuria, DarTar, Halfak, Gilles, EBernhardson, dr0ptp4kt, Harej, MusikAnimal, Abit, elukey, diego, Cparle, Ramsey-WMF, Miriam, Isaac, Nandana, JKSTNK, Akovalyov, Lahi, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, EBjune, Tramullas, Acer, V4switch, LawExplorer, Avner, Silverfish, _jensen, Susannaanas, Wong128hk, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Matanya, Mbch331, jeremyb___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs