Hi, I think we already index way more than P31 and P279. For instance we have 102.301.706 (approximation) distinct values in the term lexicon for statement_keywords. Sadly I can't extract the list of unique PIDs used (we'd have to enable field_data on statement_keywords.property). The top 1000 is: https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing I think this is because we not only index statements by PID but also by data type. So I think that the increase is smaller than what you anticipate. What I'd try to avoid in general is indexing terms that have only doc since they are pretty useless. I think we should investigate what kind of data we may have here, and at least for statement_keywords I would not index data that contain random text (esp. natural language) since they are prone to be unique and impossible to search.
On Thu, Jul 26, 2018 at 11:48 PM Stas Malyshev <smalys...@wikimedia.org> wrote: > Hi! > > Today we are indexing in ElasticSearch almost all string properties > (except a few) and select item properties (P31 and P279). We've been > asked to extend this set and index more item properties > (https://phabricator.wikimedia.org/T199884). We did not do it from the > start because we did not want to add too much data to the index at once, > and wanted to see how the index behaves. To evaluate what this change > would mean, some statistics: > > All usage of item properties in statements is about 231 million uses > (according to sqid tool database). Of those, about 50M uses are > "instance of" which we are already indexing. Another 98M uses belong to > two properties - published in (P1433) and cites (P2860). Leaving about > 86M for the rest of the properties. > > So, if we index all the item properties except P2860 and P1433, we'll be > a little more than doubling the amount of data we're storing for this > field, which seems OK. But if we index those too, we'll be essentially > quadrupling it - which may be OK too, but is bigger jump and one that > may potentially cause some issues. > > So, we have two questions: > 1. Do we want to enable indexing for all item properties? Note that if > you just want to find items with certain statement values, Wikidata > Query Service matches this use case best. It's only in combination with > actual fulltext search where on-wiki search is better. > > 2. Do we need to index P2860 and P1433 at all, and if so, would it be ok > if we omit indexing for now? > > Would be glad to hear thoughts on the matter. > > Thanks, > -- > Stas Malyshev > smalys...@wikimedia.org > > _______________________________________________ > discovery-private mailing list > discovery-priv...@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/discovery-private >
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata