Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint
My basic worries with exposing powerful query languages like SPARQL publicly is that a) there is a large attack surface in the query processing backend, and b) a client can request very expensive operations on the server without performing much work itself. Timeouts can limit the damage, but if they are set reasonably low (<1 min) they will also eliminate some of the supposed power of SPARQL, especially if the data set grows at the rate we all hope for. When reaching the timeout, the client needs to switch to iterative processing and paging. How well does blazegraph support paging of complex SPARQL queries without re-calculating the entire result set? One of the things I like about the MQL design is that they are careful about identifying a couple of main hierachies (typeOf, geographical containment, taxonomies, ?) that they can efficiently flatten into denormalized plain index lookups. These are very fast and easy to page. >From what I have seen so far, they also seem to directly cover most use cases that people have come up with so far. While perhaps too limiting in the longer term, I think such a limited 80/20 design would be a better starting point for a high-volume public API with strong availability and response time guarantees. The efficient subset of the API could then be enriched with more expensive end points over time, but those would explicitly not have the same performance guarantees as the core API. Those expensive queries could be executed on a separate cluster / set of machines to avoid interference with the core API. Another aspect that I think warrants serious attention for an API is the complexity and reliability of constructing queries programmatically. As witnessed by the many issues around seemingly simple languages like SQL, building up query strings from user-supplied values is easy to get wrong. It is always possible to build friendly query languages on top of a JSON API, but it would IMHO be a waste of developer time to repeatedly have to deal with encoding issues and bugs in each client. This doesn't rule out SPARQL (it has a JSON encoding), but I think it's a significant disadvantage of using a custom string syntax like WDQ in the API. Gabriel ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint
On 11.03.2015 11:26, Daniel Kinzler wrote: Am 11.03.2015 um 10:43 schrieb Markus Krötzsch: I was referring to the investigations that have led to this spreadsheet: https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0 That's the backend evaluation spreadsheet. I'm not arguing against BlazeGraph as a backend at all. I'm questioning the outcome of the public query language evaluation as shown in this sheet: https://docs.google.com/a/wikimedia.de/spreadsheets/d/16bbifhuoAiO7bRQ2-0mYU5FJ9ILczC-u9oCJsPdn9IU/edit#gid=0 Have a look at the weights, and st the comments, especially Gabriel's. Right, but the overall conclusion still was to use SPARQL there, and this made further discussion of particular scores irrelevant. As it is, the sheet wildly mis-estimates the relative prominence of SPARQL and WDQ (e.g., "documentation" and "support from people"). Search for "SPARQL" in Amazon to get a rough idea. There are a number of free and commercial products implementing it. I am teaching SPARQL to computer science students since at least 5 years, and I know many other people who do. The DBpedia community is using it on Wikipedia-based data. If you have a SPARQL-related question, ask at public-sparql-...@w3.org; there is usually good support there. This is really comparing apples and oranges, and it would not do justice to Magnus's work to put him up against an established technology standard. WDQ is great for what it does, but if we go "official" we should move towards what people outside of the Wikidata cosmos are using. After all, this is the main target group for a public query endpoint. Markus ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint
Am 11.03.2015 um 10:43 schrieb Markus Krötzsch: > I was referring to the investigations that have led to this spreadsheet: > > https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0 That's the backend evaluation spreadsheet. I'm not arguing against BlazeGraph as a backend at all. I'm questioning the outcome of the public query language evaluation as shown in this sheet: https://docs.google.com/a/wikimedia.de/spreadsheets/d/16bbifhuoAiO7bRQ2-0mYU5FJ9ILczC-u9oCJsPdn9IU/edit#gid=0 Have a look at the weights, and st the comments, especially Gabriel's. -- daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint
Am 11.03.2015 um 10:08 schrieb Markus Krötzsch: > What I don't see is how the use of a WDQ API on top of SPARQL would make the > overall setup any less vulnerable; it mainly introduces an additional > component > on top of SPARQL, and we can have a simpler SPARQL-based filter component > there > if we want, which is likely to be more effective in controlling usage. I disagree on both points: I believe it would be neither simpler, nor more effective. That's pretty much the core of it. However, I admit that this is currently a gut feeling, a concern I want to share and discuss. It should be investigated before making a decision. > There is a huge cost to > designing a query API from scratch, and I would really like to avoid this. Which is why I want to use one that already exists (WDQ), and back it by something that already exists (SPARQL). > Supporting WDQ on top of SPARQL would retain WDQ in its current form and still > support standards -- That's exactly what I propose. > if we want to develop an official custom API, we will give > up on both of these benefits, and at the same time push the ETA for Wikidata > queries far into the future. I disagree. If, as I believe, sandboxing WDQ is simpler than sandboxing SPARQL, using WDQ would allow us to have a public query API sooner. But whether my believe is correct needs to be investigated, of course. > All of this has been discussed and considered in the past. I don't see why one > would be kicking off discussions now that question everything decided in > meetings and telcos over the past weeks. There is absolutely no new > information > compared to what has led to the consensus that we all (including Daniel) had > reached. The consensus as I remember it was "we should be able to expose SPARQL safely, if we invest enough time to sandbox it". The issue of lock-in was mentioned but not really assessed. The relative cost for sandboxing WDQ vs SPARQL, and the impact on the ETA, was not discussed much. The ad-hoc evaluation spreadsheet shows WDQ as a second to SPARQL (before MQL and ASK), mainly because SPARQL is more powerful. The downside of that power doesn't factor into the evaluation, nor does the factor of lock-in. Shifting the relative weight in the spreadsheet from power to sustainability makes WDQ come out at the top. After the initial enthusiasm, this has made me increasingly uneasy over the last weeks. Hence my mail to this list. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint
On 11.03.2015 05:59, Tom Morris wrote: On Tue, Mar 10, 2015 at 6:17 PM, Markus Krötzsch mailto:mar...@semantic-mediawiki.org>> wrote: TL;DR: No concrete issues with SPARQL were mentioned so far; OTOH many *simple* SPARQL queries are not possible in WDQ; there is still time to restrict ourselves -- let's give SPARQL a chance before going back. TLDR, so SPARQL is the one true way. That's the danger of giving a TL;DR: people can misunderstand them and then use them as strawmen in arguments. My bad. I suggest you read the rest of the email and comment on this. The discussion is too complex and too important to be reduced to three lines. Nik and Stas have made a careful analysis of the options, ... citation please I was referring to the investigations that have led to this spreadsheet: https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0 The choice for SPARQL was not made by me or by anyone who has a special interest in pushing this particular formalism (in fact Nik and Stas can confirm that I have been quite sceptical about the feasibility of using BlazeGraph at first). It was the result of an open-minded discussion among people with very different backgrounds, in search for the most promising technology for our problem. I agree that one could continue this discussion and analysis, but we need to have a balance between theoretical discussions and practical work. It might well happen that we will give up on BlazeGraph and/or SPARQL as the result of practical experiences, but it would be foolish to give up now without even trying. Markus ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint
On 11.03.2015 00:44, Magnus Manske wrote: To be fair, the discussion is not "what will we do till the end of time", rather "what do we start with". Knowing neither SPARQL nor the data storage engine terribly well, it would not be helpful if the service can be DOSed by innocent-looking queries, intentional or not. Exposing only a subset of SPARQL (in this case, via WDQ wrapper) initially would be a way to test the waters. A proper SPARQL API can be exposed at any time later, once we're confident it will hold up. This seems more like a technical decision in terms of "operational security", rather than a philosophical one about the merits of query languages (where SPARQL is undoubtedly more powerful than WDQ). Sure, but my point is that there is zero evidence right now that such a WDQ wrapper would be more robust against intentional DOS. As I explained in my email, such a wrapper would still use a significant amount of SPARQL features in the back. I am sure there will be cases when the new service will go down (we have seen it happening to WDQ and, more generally, to Wikipedia, in the past). What I don't see is how the use of a WDQ API on top of SPARQL would make the overall setup any less vulnerable; it mainly introduces an additional component on top of SPARQL, and we can have a simpler SPARQL-based filter component there if we want, which is likely to be more effective in controlling usage. The only thing that could really lead to a more robust setup would be the use of a more robust backend engine, and I don't see what this should be. The discussion here is not about which query language we should use. What Daniel proposes is to give up on supporting a standard query language and restricting to a special-purpose API. This is a big deal. If we really want a special-purpose query language for ourselves, we would need to have a discussion about it. WDQ is a useful baseline, but it is is the result of an evolution of ideas and features over time. One would probably come up with a few different decisions when seeing the whole picture from the start. There is a huge cost to designing a query API from scratch, and I would really like to avoid this. Supporting WDQ on top of SPARQL would retain WDQ in its current form and still support standards -- if we want to develop an official custom API, we will give up on both of these benefits, and at the same time push the ETA for Wikidata queries far into the future. All of this has been discussed and considered in the past. I don't see why one would be kicking off discussions now that question everything decided in meetings and telcos over the past weeks. There is absolutely no new information compared to what has led to the consensus that we all (including Daniel) had reached. Regards, Markus ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint
On Wed, Mar 11, 2015 at 4:52 AM Tom Morris wrote: > How long has WDQ been in service? > > Before September 2013. So, 1.5-2 years. ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech