Jheald added a comment.

I'd say don't give up too easily. This is probably as good an approach as any. If the issues are structural, bots will fall prey to them in just the same way, just more slowly and more haphazardly.

But it probably is going to need a fair number of iterations to slowly get nearer to being "right" -- this isn't something that's going to be got right first time, not even nearly.

As regards (2) I am not convinced that you really are going to see that serious a blow-up.

Here are variants of your poodle and your wasp queries, tinyurl.com/yc464xdc and tinyurl.com/yajn9wou, using eg

wd:Q38904 wdt:P31?/(wdt:P279|(p:P31/pq:P642)|wdt:P171)* ?item

as their path statement, using p:P31/pq:P642 as a workaround to include the (horrible) "common name ... of" link, and allowing each step to traverse either via that or up a subclass link or up a taxon link. It still only returns 72 and 43 items respectively, so that's not much worse (ie bigger) than you were generating.

One of the specific key problems in this case, adding confusion to the list, is the specification of "dog" or "canis familiaris" (via "taxon") as a "name", leading to a whole slew of abstract items. This frankly is a nonsense, and the community needs to be told to get its act in order -- this modelling is having serious consequences for item interpretation. Across the rest of Wikidata, items represent things, not names of things. Yes, a taxon in many respects is a name, but we're using it to refer to a thing, and that needs to be the clear priority for the items. The fact that a particular taxon item also has the quality of representing a name needs to be represented in a different way, not by making "taxon" subclass of "name". I suspect it may take quite harsh pressure to actually impose this, but I think this is the kind of area where it might be quite useful for the tech and community liason team to strongly suggest to the community that the current modelling is having significant difficulties.

As for "instance of" + "common name" + "of", that's a nonsense, and the sooner we have a new specific property to express that relation specifically, the better.

But ... even fixing the "taxon" subclass of "name" issue is not going to solve the question of finally ending up with weird stuff from the top of the chain. A couple of months ago I wrote a query to pull out some items that were descended from both "physical object" and "abstract object" tinyurl.com/ya4spc62 (discussion) Our ontology at Wikidata is a mess, in so many places, and will likely take years to slowly resolve (if ever).

Besides, in many of these cases, you probably want to have items in the left-hand column discoverable for searches in the right-hand ("abstract") column -- eg you probably do want examples of African masks to come up in a search for "African art", even if Wikidata considers the former a concrete thing and the latter an abstract thing. So if you follow the subclass tree further up, you will get to all those bonkers very broad abstract concepts, which African masks are definitely not examples of.

But the good news (contra your conclusion #1) is that those are likely to be the same items every time that you will want to exclude via your stop-list, and you can probably define them by saying "everything in the subclass (P 279) tree above this item", for quite a short list of items. One could even write that into the query fairly easily with a MINUS clause, giving something like this tinyurl.com/y833nt2f, though that might not be the most efficient way to do it for production use.

And, yes, you will probably need a whitelist too -- for example, adding "woman" as a search term for every human that is female (if you're okay with that including depicted females that happen to be children). Also, it seems you're going to need to add "human" as a search term for "female". (btw I have no idea why Infovarius made this edit. It would seem to be something well worth bringing up at Wikidata's "Project Chat" discussion page, including a ping to Infovarius to see whether he would state his comment).

One thing I would suggest, though, is setting up a Wikidata WikiProject page for the project, advertising it on Project Chat, and then discussing or reviewing your thoughts for particular parts of the subject tree there. Yes, involving the community in discussion will add a huge time overhead; but with luck you may get people coming forward that really know their ways around particular parts of the project tree and may make some excellent suggestions -- or that may realise that some particular bits of ontology are causing some real problems, that would benefit from a root-and-branch rethink.


TASK DETAIL
https://phabricator.wikimedia.org/T199119

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Cparle, Jheald
Cc: Jheald, Aklapper, Ramsey-WMF, Abit, Cparle, Lahi, PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, V4switch, LawExplorer, Susannaanas, Wong128hk, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, Ricordisamoa, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Matanya, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to