To address the first point:
So the auto-matches are just simple label-mmatches. Removing the automatch
in mix'n'match just says that this was not the same person etc. and the
entry is moved back to the "unmatched" pool.

This does /not/ mean there isn't a match on Wikidata! You only say that by
setting the entry to "not on Wikidata". And I do occasionally batch-create
items for those, usually when all entries are processed. Which can have
other issues, like an item was created in the meantime, and now I create a
duplicate.

A soultion could be to change the "not on Wikidata" button (or link) to a
"create new item" button. The new item would have a label, a description
(maybe), a statement with the catalog ID (if there is an associated
WIkidata property!), and "instance of:human" if the entry is internally
marked as "person", but nothing else.

Would that be welcomed by "mix'n'matchers", and Wikidata people? I think it
would make sense, for catalogs with a Wikidata property at least.


As for the second point, I think in most cases the mere existence of a new,
better-fitting item (or at least one equally fitting at first glance) will
prevent false assignments. Sure, there are some cases, like the one given
as an example, which would profit from a P1889 "different from" statement.
We have run into that problem with the "merge game" I'm running, where
people do a lot of false merges because the items seem identical at first
glance.

However, I don't think this is prevalent enough to warrant special
treatment in mix'n'match itself. For the few cases were it would help,
Wikidata can always be edited manually. Besides, where would we draw the
line? "John Smith" returns hundreds of search results; that would translate
into tens of thousands of "different from" statements.

I think once your "Giulio Baldigara" example brother is created, and both
will show up in search results, that alone will be enough to serve as a
"different from" warning in most settings.
Mix'n'match automatch, for example, will only match entries where the exact
label is unique in labels and aliases; two items with a "Giulio Baldigara"
label or alias would not automatch an entry with that name.


On Sat, Nov 21, 2015 at 5:35 PM Dario Taraborelli <
dtarabore...@wikimedia.org> wrote:

> I finally found the time to play extensively with Mix’n’match and it’s by
> far one of the most promising models I’ve come across for Wikidata growth.
> A short conversation with Magnus on Twitter got me thinking on how to best
> preserve the output of costly human curation.[1]
>
> I spent most of my time manually auditing automatically matched entries
> from the Dizionario Biografico degli Italiani [2]. These entries are long,
> unstructured biographical entries and it takes quite a lot of effort to
> understand if the two individuals referenced by Wikidata and DBI actually
> are the same person. This is a great example of a task that’s still pretty
> hard for a machine to perform, no matter how sophisticated the algorithm.
>
> My favorite example? Mix’n’ match suggested a match between *Giulio
> Baldigara *(Q1010811 <https://www.wikidata.org/wiki/Q1010811>) and *Giulio
> Baldigara* (DBI
> <http://www.treccani.it/enciclopedia/giulio-baldigara_(Dizionario_Biografico)/>)
> which looked totally legitimate: these two individuals are both Italian
> architects from the 16th century with the same name, they were both born
> around the same years in the same city, they were both active in Hungary at
> the same time: strong indication that they are the same person, right? It
> turns out they are brothers and the full name of the person referenced in
> Wikidata is *Giulio Cesare Baldigara* (the least known in a family of
> architects). I unmatched the suggestion and flagged the DBI entry as non
> existing in Wikidata.
>
> My question at the moment is: the output of a labor-intensive review of a
> potential match is currently stored as a volatile flag in a tool hosted on
> labs, but is invisible in Wikidata. Should something happen to Mix’n’match
> (god forbid) the result of my work would get lost. Which got me thinking:
>
> - shouldn’t a manually unmatched item be created directly on Wikidata
> (after all DBI is all about notable individuals who would easily pass
> Wikidata’s notability threshold for biographies)
> - shouldn’t the relation between *Giulio (Cesare) Baldigara *(Q1010811
> <https://www.wikidata.org/wiki/Q1010811>) and the newly created item for 
> *Giulio
> Baldigara* be explicitly represented via a *not the same as* property, to
> prevent future humans or machines from accidentally remerging the two items
> based on some kind of heuristics
>
> Thoughts welcome,
>
> Dario
>
> [1] https://twitter.com/ReaderMeter/status/667214565621432320
> [2]
> https://tools.wmflabs.org/mix-n-match/?mode=catalog&catalog=55&offset=0&show_noq=0&show_autoq=1&show_userq=0&show_na=0
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to