[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
Smalyshev added a comment. > In this case, one would not be able to distinguish this from the case where > two statements with two qualifiers each had been given originally It is possible to distinguish them since claim IDs are recorded too for bookkeeping, so the split claim would have same IDs while different claims would have different IDs. I'm still not sure why this distinction is important though. > My point was that an attacker could craft a single statement that makes you > index millions of statements. It is easy to introduce limits if this would be of any concern. Since our data does not have any large numbers, limiting expansion factor by, say, 50 or so would not impact the system and would prevent such problems. TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
Smalyshev added a comment. **This changes the structure, and the original structure is no longer represented and can no longer be faithfully recovered.** This is not correct, original structure can be recovered, though I see no reason why would you want to do so. Can you name one? **a mere 40 qualifiers (20 properties, each with two values) on one forged statement, I could create a million statements in your index -- a possible DOS attack vector** Scan of the database shows there are no entries generating more than 15 qualifier splits (at least I couldn't find any a month ago). If it ever becomes a problem, we could easily institute limits, but I think vandalism is better handled on other levels than changing our data model to avoid vandals. I see no case where 20 duplicate qualifiers would legitimately be required - duplicate qualifier is usually a wrong way to represent the claim, since it essentially claims that the same event happened in two places, two times, etc. - which usually means what should there be is two separate claims, as event happening two times is two different instances of the event. So I would claim even most existing duplicates look like data errors. TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
Smalyshev added a comment. **This changes the structure, and the original structure is no longer represented and can no longer be faithfully recovered.** This is not correct, original structure can be recovered, though I see no reason why would you want to do so. Can you name one? **a mere 40 qualifiers (20 properties, each with two values) on one forged statement, I could create a million statements in your index -- a possible DOS attack vector** Scan of the database shows there are no entries generating more than 15 qualifier splits (at least I couldn't find any a month ago). If it ever becomes a problem, we could easily institute limits, but I think vandalism is better handled on other levels than changing our data model to avoid vandals. I see no case where 20 duplicate qualifiers would legitimately be required - duplicate qualifier is usually a wrong way to represent the claim, since it essentially claims that the same event happened in two places, two times, etc. - which usually means what should there be is two separate claims, as event happening two times is two different instances of the event. So I would claim even most existing duplicates look like data errors. TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
mkroetzsch added a comment. @JanZerebecki I understand what you are saying about what "indexing" means here. Makes sense to me. What you are saying about my example query sounds as if you are planning to implement query execution manually. I hope this is not the case and you can just give the query to Titan to get it answered for you. You mentioned splitting certain statements with duplicate qualifiers. This changes the structure, and the original structure is no longer represented and can no longer be faithfully recovered. I don't know if this is an issue with the current data (which duplicate qualifiers are actually used?) but it is an issue in general. It also means that with a mere 40 qualifiers (20 properties, each with two values) on one forged statement, I could create a million statements in your index -- a possible DOS attack vector? An alternative technique for handling such cases is to use a special encoding for overflowing values. This is done successfully in some database implementations, but it may mean additional checks during query answering, and depending on how low in the query answering process you can do these checks, they may or may not add significant cost (hence it's a pity that Titan does not support this efficiently out of the box). TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, mkroetzsch Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
mkroetzsch added a comment. @Smalyshev My point is merely that sitelinks and labels //can// be handled like statements. Since statements must be supported anyway, it would be sensible to reuse the data structures and query expressions defined for them. I don't think that confusion is likely, since the query language will not use the colloquial names as my examples. Properties of Wikidata will always be referred to by their Pid, whereas something like "has badge" would not have an id of this form. So it's not like having a reserved label "has badge" that competes with Wikidata property labels. Structurally, however, statements and sitelinks can all be represented in the same data structure as far as querying is concerned. Maybe you would like it better if you viewed it as a separate, independent data structure "qualified triple" that we would use to represent statements, sitelinks, and labels? There is no conceptual mix-up between the high-level terms here, just taking advantage of similar structure at a low level. If you look at common query languages like SQL, SPARQL, Cypher, etc. then you can see that they are always based on a relatively small set of structural primitives that do not have a domain-specific meaning. You can always build UIs that use domain specific terms like "sitelink" and that make them appear separate, but for implementers and API users it is very useful if some things can be unified. TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, mkroetzsch Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
Smalyshev added a comment. @mkroetzsch the issue is that "//point in time//" is an official property in wikidata, which can be looked up and associated with property //P585//. However, "//has badge//" is not a property and as such can not be expressed in these terms. Of course, we can make a language that would have some logic - either database-backed or just whitelist-based - that would identify that "//has badge//" is a special case and create special translation for it. That would complicate the engine, but it's not the main issue. I wonder if it is really what the expectations of the users would be. After all, we display links and badges in different form than claims in the main wikidata UI - wouldn't the users of the same UI expect different handling of the sitelinks and badges in the query API too? **any query that would work over statements would also be expected to work for sitelinks** I'm not sure we want to make that claim, because I don't think it is true. The data model is different for sitelinks as opposed to claims - claims have values, qualifiers and references, while sitelinks have badges. In fact, there's very little in common between the two of them except for the fact that they are both attached to items. So I am concerned that claiming that would be rather confusing. TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
mkroetzsch added a comment. @Smalyshev My suggestion was just about the surface appearance, not about the inner workings. I am saying that the following two phrases have the same structure: - "Find things with a *sitelink* that *has badge* *featured*". - "Find things with a *population* that has *point in time* *2014*". If you look at it like this, you use "badge" like a (special) qualifier property and "featured" like a value. This does not mean that the query answering will be any less efficient than with another syntax. The query engine would easily parse the queries in any case, without ambiguity, and know that sitelinks are (possibly) stored in a different way internally. Same for labels. The reason I was suggesting this unification here was that it also somehow answers the question "what do we mean by 'indexing' this data?": any query that would work over statements would also be expected to work for sitelinks, even if different structures are used internally. TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, mkroetzsch Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
JanZerebecki added a comment. I understood "indexing everything" to mean that at least support answering a query for it with an exact value and traversing to things it is connected to without iterating over anything else. (Technically for Titan a composite index over one key, see http://s3.thinkaurelius.com/docs/titan/current/indexes.html#_composite_index .) I.e. support finding badges "Featured Article" and also support traversing to the site link one is connected to and then which item that is connected to. So this does not imply any fancy indices (like over functions) nor even full text, prefix or range indices. Does it sound OK to have this as a baseline? If there are no other indices that means e.g. that we can answer "Find people whose parents are married to each other." only by looking at every human that is married (which so far is indexed by exact value for each of the lookups on its own; i.e. no combined index for married humans) and then traversing to their children where both parents are in the previous set of humans (which is AFAIK currently not helped by any index). A question about qualifiers: currently if there are two qualifiers with the same property on the same statement, it will be split when importing into Titan like as if it were two statements, is this something that is acceptable or should we correct this now? TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, JanZerebecki Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
mkroetzsch added a comment. In https://phabricator.wikimedia.org/T86278#969184, @Multichill wrote: > I would like to turn it around. We should support indexing everything: ... > The fact that we're not creative enough to make up queries for everything > doesn't mean it isn't useful. I have to disagree with this approach of designing a (query) system. Of course supporting "everything" would be nice, but indexing always needs to be viewed in the context of a query language. Depending on the query features you support, "indexing" may mean completely different things. The requirement that "everything should be indexed" is fuzzy and vague. For example, Wikidata Query supports regular path queries (in Wikidata, Kleene-star recursion is accessed with an operator called "TREE"). When you say that references should be indexed, do you mean that it should be possible to navigate through references within regular path expressions? How about qualifiers? I think Wikidata query only supports TREE in the main statements. On the other hand, Wikidata query does not support any kind of cyclic queries ("Find people whose parents are married to each other.") even though the relevant data is indexed. The point is that you need adequate index structures for each query feature. You may have "indexed everything" and still not be able to do the queries you want. Moreover, I am creative enough to come up with queries that would be very hard to implement, yet this does not mean that we should do it. Both "everything" and "everything we are creative enough for" are poor design principles. So how to move forward? The usual approach to design a practical system is to collect use cases and requirements (example queries) and then make a clear decision what should be supported and what shouldn't. One can always revise this decision later, but it's still much better to make it explicit than to just go along and see what we get. In all of this, it must be understood that supporting "everything" is an impossible task. The question is merely where to draw the line(s). TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, mkroetzsch Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
Multichill added a comment. I would like to turn it around. We should support indexing everything: - Labels - Aliases - Descriptions - Sitelinks - Including badges - Statements - Including ranks etc - Including qualifiers - Including references (probably forgot something) We should have a good reason to not index something. The fact that we're not creative enough to make up queries for everything doesn't mean it isn't useful. TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, Multichill Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
Lydia_Pintscher added a comment. Didn't you have qualifiers in the proposal already? Anyway. They should be in ;-) Some examples: References: "Give me all statements referenced to Nature" or "Give me all cities in Germany over 2 Million inhabitants based on the national statistics office" Aliases: The same as labels Sitelinks: "Give me all people that have an article on Malayalam Wikipedia" Badges: "Give me all people that have an excellent article on English Wikipedia" TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, Lydia_Pintscher Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
Smalyshev added a comment. Only things explicitly mentioned in the wiki are covered now. So, qualifiers are covered, but references, sitelinks, badges and aliases are not. I'd like to hear more what lookups would we do against sitelinks, aliases and references - do we have any existing examples? TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
Lydia_Pintscher added a comment. Yeah they are. TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, Lydia_Pintscher Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store
Smalyshev added a comment. Here we would like some feedback from wikidata team. TASK DETAIL https://phabricator.wikimedia.org/T86278 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign . EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev Cc: Aklapper, Smalyshev, jkroll, Wikidata-bugs, aude, GWicke, Manybubbles, daniel ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs