[Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

Maarten Dammers Sat, 04 Aug 2018 03:29:46 -0700

Hi Stas and Hay,


On 28-07-18 02:12, Stas Malyshev wrote:

Hi!

I could definitely see a usecase for 1) and maybe for 2). For example,
let's say i remember that one movie that Rutger Hauer played in, just
searching for 'movie rutger hauer' gives back nothing:

https://www.wikidata.org/w/index.php?search=movie+rutger+hauer

While Wikipedia gives back quite a nice list of options:

https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer

Well, this is not going to change with the work we're discussing. The
reason you don't get anything from Wikidata is because "movie" and
"rutger hauer" are labels from different documents and ElasticSearch
does not do joins. We only index each document in itself, and possibly
some additional data, but indexing labels from other documents is now
beyond what we're doing. We could certainly discuss it but that would be
separate (and much bigger) discussion.

Changing the topic because I would like to start this separate andbigger discussion. Query and search are quite similar, but also verydifferent (if you search you'll run into nice articles likehttps://everypageispageone.com/2011/07/13/search-vs-query/ ). Currentlyour query service is a very strong and complete service, but Wikidatasearch is very poor. Let's take Blade Runner.

* https://www.wikidata.org/wiki/Q184843 is what a human sees
* http://www.wikidata.org/entity/Q184843.json our internal JSON structure
* http://www.wikidata.org/entity/Q184843.rdf source for the query engine

* https://www.wikidata.org/w/index.php?title=Q184843&action=cirrusdumpwhat's indexed in the search engine

In my ideal world, everything I see as a human gets indexed into thesearch engine preferably in a per language index. For example for Dutchsomething like a text_nl field with the, label, description, aliases,statements and references in there. So index *everything* and never seea Qnumber or Pnumber in there (extra incentive for people to add labelsin their language). Probably also everything duplicated in the textfield to fall back to. In this index you would have the "movie RutgerHauer", you would have the cast members ("rolverdeling: Harrison Ford"etc.). Yes, this will give a significant increase of index size, butwill make it much more easier to actually find things.

As for implementation: We already have the logic to serialize our jsonto the RDF format. Maybe also add a serialization format for this thatis easy to ingest by search engines? I noticed Google having a hard timeindexing some of our items, see for examplehttps://www.google.com/search?q=The+Feast+of+the+Seagods+site%3Awikidata.org&ie=utf-8&oe=utf-8. Duck Duck Go seems to be doing a better jobhttps://duckduckgo.com/?q=The+Feast+of+the+Seagods+site%3Awikidata.org&t=h_&ia=web. Making it easier to index not only for our own search would be a niceadded benefit.

How feasible is this? Do we already have one or multiple tasks for thison Phabricator? Phabricator has gotten a bit unclear when it comes toWikidata search, I think because of misunderstanding between people whatthe goal of the task is. Might be worthwhile spending some time onstructuring that.


Maarten

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

Reply via email to