Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons

Markus Kroetzsch Fri, 19 Oct 2018 02:41:04 -0700

Hi James,

On 19/10/2018 01:09, James Heald wrote:

On 18/10/2018 22:33, Markus Kroetzsch wrote:
And, on another note, there is also a huge misunderstanding exposed in the discussion on th search-related tracker item [1]: Cparle there speaks about "traversing the subclass hierarchy" but is actually looking at *super*classes of, e.g., "Clarinet", which he mostly finds irrelevant to users who care about clarinets. But surely that's the wrong direction! You have to look for *sub*classes to find special cases of what you are looking for. Looking downwards will often lead to much saner ontologies than when turning your head towards the dizzy heights of upper ontology. Yes, the few of us looking for instances of "logical consequence" will still get clarinets, but those who look for instances of clarinet merely will see instances of alto clarinet, piccolo clarinet, basset horn, Saxonette, and so on [2]. So instead of trying to suggest to Commons editors meaningful "upper concepts", one could simply enable the use of lower concepts in search. It does not work in all cases yet, but it many.
Not really.
Cparle wants to make sure that people searching for "clarinet" also get shown images of "piccolo clarinet" etc.
To make this possible, where an image has been tagged "basset horn" he is therefore looking to add "clarinet" as an additional keyword, so that if somebody types "clarinet" into the search box, one of the images retrieved by ElasticSearch will be the basset horn one.
I imagine there are pluses and minuses both ways, whether you try to make sure one search returns more hits, or try to run multiple searches each returning fewer hits.
Your suggestion of the latter approach may not involve so much pre-investigation of the top of the tree, which may be terms that people are less likely to search for; but on the other hand, the actual searching may be less efficient than a single indexed search.

True, but with the Wikidata Query Service we already have infrastructure that completes millions of search requests of this kind (involving path queries), so that seems doable for Commons as well. WDQS already has Wikimedia API bindings that allow it to use Lucene-based results in addition, if needed (though this would only make sense if the search should use some content that for some reason cannot be imported into a query service as graph data, mostly free-text search over longer texts).

I think the approach of completing tags towards the upper classes is not a good idea in general, since it creates extra work for editors that requires a million times the resources needed in the other approach: if the subclass hierarchy is wrong, you only need to fix it once to improve search for all existing Commons content; if you rely on manual extra tags, you'd have to add them to every file on Commons and keep it up-to-date with changes in the concepts -- an enormous, redundant effort that will invariably lead to a very non-uniform search experience across otherwise similar media. This seems like a huge waste of editors' time even if it would work (i.e., if we would live in a world where the superclasses of a class would be easy to understand and closely related to the topic that an editor is working on -- which will never happen for Wikidata or Commons, since both cover such a breadth of topics that their upper ontology necessarily has to be very general even if modelled in a clean and fully correct way).


Cheers,

Markus

There are still problems (such as the biological taxonomy being modelled as a hierarchy of names rather than animal classes, placing dog far away from mammal), but it is still always much easier to come up with a sane organisation for the *sub*classes of a concrete class.
For what it's worth, there's currently quite a lively discussion on Project Chat about issues with the current modelling of biological taxonomies, https://www.wikidata.org/wiki/Wikidata:Project_chat#Taxonomy:_concept_centric_vs_name_centric
People on this thread might like to comment on some of the less fortunate elements of current practice, and the appropriateness of some of the thoughts that have been suggested.
But the taxo project has become such a walled garden, answerable only to itself, that people with comments may need to be quite forceful to get their message through, if we are to deal eg with some of the difficulties Cparle describes in the ticket at
  https://phabricator.wikimedia.org/T199119

   -- James.

---
This email has been checked for viruses by AVG.
https://www.avg.com


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons

Reply via email to