Hi Stas and Hay,

On 28-07-18 02:12, Stas Malyshev wrote:
Hi!

I could definitely see a usecase for 1) and maybe for 2). For example,
let's say i remember that one movie that Rutger Hauer played in, just
searching for 'movie rutger hauer' gives back nothing:

https://www.wikidata.org/w/index.php?search=movie+rutger+hauer

While Wikipedia gives back quite a nice list of options:

https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer
Well, this is not going to change with the work we're discussing. The
reason you don't get anything from Wikidata is because "movie" and
"rutger hauer" are labels from different documents and ElasticSearch
does not do joins. We only index each document in itself, and possibly
some additional data, but indexing labels from other documents is now
beyond what we're doing. We could certainly discuss it but that would be
separate (and much bigger) discussion.
Changing the topic because I would like to start this separate and bigger discussion. Query and search are quite similar, but also very different (if you search you'll run into nice articles like https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently our query service is a very strong and complete service, but Wikidata search is very poor. Let's take Blade Runner.
* https://www.wikidata.org/wiki/Q184843 is what a human sees
* http://www.wikidata.org/entity/Q184843.json our internal JSON structure
* http://www.wikidata.org/entity/Q184843.rdf source for the query engine
* https://www.wikidata.org/w/index.php?title=Q184843&action=cirrusdump what's indexed in the search engine

In my ideal world, everything I see as a human gets indexed into the search engine preferably in a per language index. For example for Dutch something like a text_nl field with the, label, description, aliases, statements and references in there. So index *everything* and never see a Qnumber or Pnumber in there (extra incentive for people to add labels in their language). Probably also everything duplicated in the text field to fall back to. In this index you would have the "movie Rutger Hauer", you would have the cast members ("rolverdeling: Harrison Ford" etc.). Yes, this will give a significant increase of index size, but will make it much more easier to actually find things.

As for implementation: We already have the logic to serialize our json to the RDF format. Maybe also add a serialization format for this that is easy to ingest by search engines? I noticed Google having a hard time indexing some of our items, see for example https://www.google.com/search?q=The+Feast+of+the+Seagods+site%3Awikidata.org&ie=utf-8&oe=utf-8 . Duck Duck Go seems to be doing a better job https://duckduckgo.com/?q=The+Feast+of+the+Seagods+site%3Awikidata.org&t=h_&ia=web . Making it easier to index not only for our own search would be a nice added benefit.

How feasible is this? Do we already have one or multiple tasks for this on Phabricator? Phabricator has gotten a bit unclear when it comes to Wikidata search, I think because of misunderstanding between people what the goal of the task is. Might be worthwhile spending some time on structuring that.

Maarten

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to