Hi Stas and Hay,
On 28-07-18 02:12, Stas Malyshev wrote:
Hi!
I could definitely see a usecase for 1) and maybe for 2). For example,
let's say i remember that one movie that Rutger Hauer played in, just
searching for 'movie rutger hauer' gives back nothing:
https://www.wikidata.org/w/index.php?search=movie+rutger+hauer
While Wikipedia gives back quite a nice list of options:
https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer
Well, this is not going to change with the work we're discussing. The
reason you don't get anything from Wikidata is because "movie" and
"rutger hauer" are labels from different documents and ElasticSearch
does not do joins. We only index each document in itself, and possibly
some additional data, but indexing labels from other documents is now
beyond what we're doing. We could certainly discuss it but that would be
separate (and much bigger) discussion.
Changing the topic because I would like to start this separate and
bigger discussion. Query and search are quite similar, but also very
different (if you search you'll run into nice articles like
https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently
our query service is a very strong and complete service, but Wikidata
search is very poor. Let's take Blade Runner.
* https://www.wikidata.org/wiki/Q184843 is what a human sees
* http://www.wikidata.org/entity/Q184843.json our internal JSON structure
* http://www.wikidata.org/entity/Q184843.rdf source for the query engine
* https://www.wikidata.org/w/index.php?title=Q184843&action=cirrusdump
what's indexed in the search engine
In my ideal world, everything I see as a human gets indexed into the
search engine preferably in a per language index. For example for Dutch
something like a text_nl field with the, label, description, aliases,
statements and references in there. So index *everything* and never see
a Qnumber or Pnumber in there (extra incentive for people to add labels
in their language). Probably also everything duplicated in the text
field to fall back to. In this index you would have the "movie Rutger
Hauer", you would have the cast members ("rolverdeling: Harrison Ford"
etc.). Yes, this will give a significant increase of index size, but
will make it much more easier to actually find things.
As for implementation: We already have the logic to serialize our json
to the RDF format. Maybe also add a serialization format for this that
is easy to ingest by search engines? I noticed Google having a hard time
indexing some of our items, see for example
https://www.google.com/search?q=The+Feast+of+the+Seagods+site%3Awikidata.org&ie=utf-8&oe=utf-8
. Duck Duck Go seems to be doing a better job
https://duckduckgo.com/?q=The+Feast+of+the+Seagods+site%3Awikidata.org&t=h_&ia=web
. Making it easier to index not only for our own search would be a nice
added benefit.
How feasible is this? Do we already have one or multiple tasks for this
on Phabricator? Phabricator has gotten a bit unclear when it comes to
Wikidata search, I think because of misunderstanding between people what
the goal of the task is. Might be worthwhile spending some time on
structuring that.
Maarten
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata