Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-11 Thread Gabriel Wicke
My basic worries with exposing powerful query languages like SPARQL
publicly is that a) there is a large attack surface in the query processing
backend, and b) a client can request very expensive operations on the
server without performing much work itself. Timeouts can limit the damage,
but if they are set reasonably low (<1 min) they will also eliminate some
of the supposed power of SPARQL, especially if the data set grows at the
rate we all hope for. When reaching the timeout, the client needs to switch
to iterative processing and paging. How well does blazegraph support paging
of complex SPARQL queries without re-calculating the entire result set?

One of the things I like about the MQL design is that they are careful
about identifying a couple of main hierachies (typeOf, geographical
containment, taxonomies, ?) that they can efficiently flatten into
denormalized plain index lookups. These are very fast and easy to page.
>From what I have seen so far, they also seem to directly cover most use
cases that people have come up with so far. While perhaps too limiting in
the longer term, I think such a limited 80/20 design would be a better
starting point for a high-volume public API with strong availability and
response time guarantees. The efficient subset of the API could then be
enriched with more expensive end points over time, but those would
explicitly not have the same performance guarantees as the core API. Those
expensive queries could be executed on a separate cluster / set of machines
to avoid interference with the core API.

Another aspect that I think warrants serious attention for an API is the
complexity and reliability of constructing queries programmatically. As
witnessed by the many issues around seemingly simple languages like SQL,
building up query strings from user-supplied values is easy to get wrong.
It is always possible to build friendly query languages on top of a JSON
API, but it would IMHO be a waste of developer time to repeatedly have to
deal with encoding issues and bugs in each client. This doesn't rule out
SPARQL (it has a JSON encoding), but I think it's a significant
disadvantage of using a custom string syntax like WDQ in the API.

Gabriel
___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-11 Thread Markus Krötzsch

On 11.03.2015 11:26, Daniel Kinzler wrote:

Am 11.03.2015 um 10:43 schrieb Markus Krötzsch:

I was referring to the investigations that have led to this spreadsheet:

https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0


That's the backend evaluation spreadsheet. I'm not arguing against BlazeGraph as
a backend at all.

I'm questioning the outcome of the public query language evaluation as shown in
this sheet:

https://docs.google.com/a/wikimedia.de/spreadsheets/d/16bbifhuoAiO7bRQ2-0mYU5FJ9ILczC-u9oCJsPdn9IU/edit#gid=0

Have a look at the weights, and st the comments, especially Gabriel's.


Right, but the overall conclusion still was to use SPARQL there, and 
this made further discussion of particular scores irrelevant. As it is, 
the sheet wildly mis-estimates the relative prominence of SPARQL and WDQ 
(e.g., "documentation" and "support from people"). Search for "SPARQL" 
in Amazon to get a rough idea. There are a number of free and commercial 
products implementing it. I am teaching SPARQL to computer science 
students since at least 5 years, and I know many other people who do. 
The DBpedia community is using it on Wikipedia-based data. If you have a 
SPARQL-related question, ask at public-sparql-...@w3.org; there is 
usually good support there.


This is really comparing apples and oranges, and it would not do justice 
to Magnus's work to put him up against an established technology 
standard. WDQ is great for what it does, but if we go "official" we 
should move towards what people outside of the Wikidata cosmos are 
using. After all, this is the main target group for a public query endpoint.


Markus


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-11 Thread Daniel Kinzler
Am 11.03.2015 um 10:43 schrieb Markus Krötzsch:
> I was referring to the investigations that have led to this spreadsheet:
> 
> https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0

That's the backend evaluation spreadsheet. I'm not arguing against BlazeGraph as
a backend at all.

I'm questioning the outcome of the public query language evaluation as shown in
this sheet:

https://docs.google.com/a/wikimedia.de/spreadsheets/d/16bbifhuoAiO7bRQ2-0mYU5FJ9ILczC-u9oCJsPdn9IU/edit#gid=0

Have a look at the weights, and st the comments, especially Gabriel's.

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-11 Thread Daniel Kinzler
Am 11.03.2015 um 10:08 schrieb Markus Krötzsch:
> What I don't see is how the use of a WDQ API on top of SPARQL would make the
> overall setup any less vulnerable; it mainly introduces an additional 
> component
> on top of SPARQL, and we can have a simpler SPARQL-based filter component 
> there
> if we want, which is likely to be more effective in controlling usage. 

I disagree on both points: I believe it would be neither simpler, nor more
effective. That's pretty much the core of it.

However, I admit that this is currently a gut feeling, a concern I want to share
and discuss. It should be investigated before making a decision.

> There is a huge cost to
> designing a query API from scratch, and I would really like to avoid this.

Which is why I want to use one that already exists (WDQ), and back it by
something that already exists (SPARQL).

> Supporting WDQ on top of SPARQL would retain WDQ in its current form and still
> support standards -- 

That's exactly what I propose.

> if we want to develop an official custom API, we will give
> up on both of these benefits, and at the same time push the ETA for Wikidata
> queries far into the future.

I disagree. If, as I believe, sandboxing WDQ is simpler than sandboxing SPARQL,
using WDQ would allow us  to have a public query API sooner. But whether my
believe is correct needs to be investigated, of course.

> All of this has been discussed and considered in the past. I don't see why one
> would be kicking off discussions now that question everything decided in
> meetings and telcos over the past weeks. There is absolutely no new 
> information
> compared to what has led to the consensus that we all (including Daniel) had
> reached.

The consensus as I remember it was "we should be able to expose SPARQL safely,
if we invest enough time to sandbox it". The issue of lock-in was mentioned but
not really assessed. The relative cost for sandboxing WDQ vs SPARQL, and the
impact on the ETA, was not discussed much. The ad-hoc evaluation spreadsheet
shows WDQ as a second to SPARQL (before MQL and ASK), mainly because SPARQL is
more powerful.

The downside of that power doesn't factor into the evaluation, nor does the
factor of lock-in. Shifting the relative weight in the spreadsheet from power to
sustainability makes WDQ come out at the top.

After the initial enthusiasm, this has made me increasingly uneasy over the last
weeks. Hence my mail to this list.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-11 Thread Markus Krötzsch

On 11.03.2015 05:59, Tom Morris wrote:

On Tue, Mar 10, 2015 at 6:17 PM, Markus Krötzsch
mailto:mar...@semantic-mediawiki.org>>
wrote:

TL;DR: No concrete issues with SPARQL were mentioned so far; OTOH
many *simple* SPARQL queries are not possible in WDQ; there is still
time to restrict ourselves -- let's give SPARQL a chance before
going back.


TLDR, so SPARQL is the one true way.


That's the danger of giving a TL;DR: people can misunderstand them and 
then use them as strawmen in arguments. My bad. I suggest you read the 
rest of the email and comment on this. The discussion is too complex and 
too important to be reduced to three lines.




Nik and Stas have made a careful analysis of the options, ...


citation please


I was referring to the investigations that have led to this spreadsheet:

https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0

The choice for SPARQL was not made by me or by anyone who has a special 
interest in pushing this particular formalism (in fact Nik and Stas can 
confirm that I have been quite sceptical about the feasibility of using 
BlazeGraph at first). It was the result of an open-minded discussion 
among people with very different backgrounds, in search for the most 
promising technology for our problem. I agree that one could continue 
this discussion and analysis, but we need to have a balance between 
theoretical discussions and practical work. It might well happen that we 
will give up on BlazeGraph and/or SPARQL as the result of practical 
experiences, but it would be foolish to give up now without even trying.


Markus


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-11 Thread Markus Krötzsch

On 11.03.2015 00:44, Magnus Manske wrote:

To be fair, the discussion is not "what will we do till the end of
time", rather "what do we start with".

Knowing neither SPARQL nor the data storage engine terribly well, it
would not be helpful if the service can be DOSed by innocent-looking
queries, intentional or not. Exposing only a subset of SPARQL (in this
case, via WDQ wrapper) initially would be a way to test the waters. A
proper SPARQL API can be exposed at any time later, once we're confident
it will hold up.

This seems more like a technical decision in terms of "operational
security", rather than a philosophical one about the merits of query
languages (where SPARQL is undoubtedly more powerful than WDQ).



Sure, but my point is that there is zero evidence right now that such a 
WDQ wrapper would be more robust against intentional DOS. As I explained 
in my email, such a wrapper would still use a significant amount of 
SPARQL features in the back. I am sure there will be cases when the new 
service will go down (we have seen it happening to WDQ and, more 
generally, to Wikipedia, in the past). What I don't see is how the use 
of a WDQ API on top of SPARQL would make the overall setup any less 
vulnerable; it mainly introduces an additional component on top of 
SPARQL, and we can have a simpler SPARQL-based filter component there if 
we want, which is likely to be more effective in controlling usage. The 
only thing that could really lead to a more robust setup would be the 
use of a more robust backend engine, and I don't see what this should be.


The discussion here is not about which query language we should use. 
What Daniel proposes is to give up on supporting a standard query 
language and restricting to a special-purpose API. This is a big deal. 
If we really want a special-purpose query language for ourselves, we 
would need to have a discussion about it. WDQ is a useful baseline, but 
it is is the result of an evolution of ideas and features over time. One 
would probably come up with a few different decisions when seeing the 
whole picture from the start. There is a huge cost to designing a query 
API from scratch, and I would really like to avoid this. Supporting WDQ 
on top of SPARQL would retain WDQ in its current form and still support 
standards -- if we want to develop an official custom API, we will give 
up on both of these benefits, and at the same time push the ETA for 
Wikidata queries far into the future.


All of this has been discussed and considered in the past. I don't see 
why one would be kicking off discussions now that question everything 
decided in meetings and telcos over the past weeks. There is absolutely 
no new information compared to what has led to the consensus that we all 
(including Daniel) had reached.


Regards,

Markus


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-11 Thread Magnus Manske
On Wed, Mar 11, 2015 at 4:52 AM Tom Morris  wrote:

> How long has WDQ been in service?
>
>
Before September 2013. So, 1.5-2 years.
___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech