I finally found the time to play extensively with Mix’n’match and it’s by far 
one of the most promising models I’ve come across for Wikidata growth. A short 
conversation with Magnus on Twitter got me thinking on how to best preserve the 
output of costly human curation.[1]

I spent most of my time manually auditing automatically matched entries from 
the Dizionario Biografico degli Italiani [2]. These entries are long, 
unstructured biographical entries and it takes quite a lot of effort to 
understand if the two individuals referenced by Wikidata and DBI actually are 
the same person. This is a great example of a task that’s still pretty hard for 
a machine to perform, no matter how sophisticated the algorithm.

My favorite example? Mix’n’ match suggested a match between Giulio Baldigara 
(Q1010811 <https://www.wikidata.org/wiki/Q1010811>) and Giulio Baldigara (DBI 
<http://www.treccani.it/enciclopedia/giulio-baldigara_(Dizionario_Biografico)/>)
 which looked totally legitimate: these two individuals are both Italian 
architects from the 16th century with the same name, they were both born around 
the same years in the same city, they were both active in Hungary at the same 
time: strong indication that they are the same person, right? It turns out they 
are brothers and the full name of the person referenced in Wikidata is Giulio 
Cesare Baldigara (the least known in a family of architects). I unmatched the 
suggestion and flagged the DBI entry as non existing in Wikidata.

My question at the moment is: the output of a labor-intensive review of a 
potential match is currently stored as a volatile flag in a tool hosted on 
labs, but is invisible in Wikidata. Should something happen to Mix’n’match (god 
forbid) the result of my work would get lost. Which got me thinking:

- shouldn’t a manually unmatched item be created directly on Wikidata (after 
all DBI is all about notable individuals who would easily pass Wikidata’s 
notability threshold for biographies)
- shouldn’t the relation between Giulio (Cesare) Baldigara (Q1010811 
<https://www.wikidata.org/wiki/Q1010811>) and the newly created item for Giulio 
Baldigara be explicitly represented via a not the same as property, to prevent 
future humans or machines from accidentally remerging the two items based on 
some kind of heuristics

Thoughts welcome,

Dario

[1] https://twitter.com/ReaderMeter/status/667214565621432320 
<https://twitter.com/ReaderMeter/status/667214565621432320>
[2] 
https://tools.wmflabs.org/mix-n-match/?mode=catalog&catalog=55&offset=0&show_noq=0&show_autoq=1&show_userq=0&show_na=0
 
<https://tools.wmflabs.org/mix-n-match/?mode=catalog&catalog=55&offset=0&show_noq=0&show_autoq=1&show_userq=0&show_na=0>


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to