Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Stas Malyshev
Hi!

> The Linked data fragments approach Osma mentioned is very interesting
> (particularly the bit about setting it up on top of an regularily
> updated existing endpoint), and could provide another alternative,
> but I have not yet experimented with it.

There is apparently this: https://github.com/CristianCantoro/wikidataldf
though not sure what it its status - I just found it.

In general, yes, I think checking out LDF may be a good idea. I'll put
it on my todo list.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Neubert, Joachim
It's great how this discussion evolves - thanks to everybody!

Technically, I completely agree that in practice it may prove impossible to 
predict the load a query will produce. Relational databases have invested years 
and years in query optimization (e.g., Oracles cost based optimizer, which 
relies on extended statistics gathered during runtime), and I can't see that 
similar investments are possible for triple stores.

What I could imagine for public endpoints is the SPARQL engine monitoring and 
prioritizing queries: the longer a query already runs, or the more resources it 
has already used, the lower its priority is re-scheduled (up to some final 
limit). But this is just a theoretical consideration, I'm not aware of any 
system that implements anything like this - and it could be implemented only in 
the engine itself.

For ZBWs SPARQL endpoints, I've implemented a much simpler three-level 
strategy, which does not involve the engine at all:

1. Endpoints which drive production-level services (e.g. autosuggest or 
retrieval enhancement functions). These endpoints run on separate machines and 
offer completely encapsulated services via a public API 
(http://zbw.eu/beta/econ-ws), without any direct SPARQL access.

2. Public "beta" endpoints (http://zbw.eu/beta/sparql). These offer 
unrestricted SPARQL access, but without any garanties about performance or 
availability - though of course I do my best to keep these up and running. They 
run on an own virtual machine, and should not hurt any other parts of the 
infrastructure when getting overloaded or out of control.

3. Public "experimental" endpoints. These include in particular an endpoint for 
the GND dataset with 130m triples. It was mainly created for internal use 
because (to my best knowledge) no other public GND endpoint exists. The 
endpoint is not linked from the GND pages of DNB, and I've advertised it very 
low-key on a few mailing lists. For these experimental endpoints, we reserve 
the right to shut them down for the public if they get flooded with more 
requests than they can handle.

It may be of interest, that up to now, on none of these public endpoints we 
came across issues with attacks or evil-minded queries (which were a matter of 
concern when I started this in 2009), nor with longer-lasting massive access. 
Of course, that is different for Wikidata, where the data is of interest for 
_much_ more people. But if anyhow affordable, I'd like to encourage offering 
some kind of experimental access with really wide limits in an "unstable" 
setting, in addition to the reliable services. For most people who just want to 
check out something, it's not an option to download the whole dataset and set 
up an infrastructure for it. For us, this was an issue with even the much 
smaller GND set.

The Linked data fragments approach Osma mentioned is very interesting 
(particularly the bit about setting it up on top of an regularily updated 
existing endpoint), and could provide another alternative, but I have not yet 
experimented with it.

Have a fine weekend - Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Markus Krötzsch
Gesendet: Freitag, 12. Februar 2016 09:44
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

On 12.02.2016 00:04, Stas Malyshev wrote:
> Hi!
>
>> We basically have two choices: either we offer a limited interface 
>> that only allows for a narrow range of queries to be run at all. Or 
>> we offer a very general interface that can run arbitrary queries, but 
>> we impose limits on time and memory consumption. I would actually 
>> prefer the first option, because it's more predictable, and doesn't get 
>> people's hopes up too far. What do you think?
>
> That would require implementing pretty smart SPARQL parser... I don't 
> think it worth the investment of time. I'd rather put caps on runtime 
> and maybe also on parallel queries per IP, to ensure fair access. We 
> may also have a way to run longer queries - in fact, we'll need it 
> anyway if we want to automate lists - but that is longer term, we'll 
> need to figure out infrastructure for that and how we allocate access.

+1

Restricting queries syntactically to be "simpler" is what we did in Semantic 
MediaWiki (because MySQL did not support time/memory limits per query). It is a 
workaround, but it will not prevent long-running queries unless you make the 
syntactic restrictions really severe (and thereby forbid many simple queries, 
too). I would not do it if there is support for time/memory limits instead.

In the end, even the SPARQL engines are not able to predict reliably how 
complicated a query is going to be -- it's an important part of their work (for 
optimising query execution), but it is also very difficult.

Markus

>


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lis

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Markus Krötzsch

On 12.02.2016 10:01, Osma Suominen wrote:

12.02.2016, 10:43, Markus Krötzsch wrote:


Restricting queries syntactically to be "simpler" is what we did in
Semantic MediaWiki (because MySQL did not support time/memory limits per
query). It is a workaround, but it will not prevent long-running queries
unless you make the syntactic restrictions really severe (and thereby
forbid many simple queries, too). I would not do it if there is support
for time/memory limits instead.


Would providing a Linked Data Fragments server [1] help here? It seems
to be designed exactly for situations like this, where you want to
provide a SPARQL query service a large amount of linked data, but are
worried about server performance particularly for complex, long-running
queries. Linked Data Fragments pushes some of the heavy processing to
the client side, which parses and executes the SPARQL queries.

Dynamically updating the data might be an issue here, but some of the
server implementations support running on top of a SPARQL endpoint [2].
I think that from the perspective of the server this means that a heavy,
long-running SPARQL query is broken up already on the client side into
several small, simple SPARQL queries that are relatively easy to serve.


There already is such a service for Wikidata (Cristian Consonni has set 
it up a while ago). You could try if the query would work there. I think 
that such queries would be rather challenging for a server of this type, 
since they require you to iterate almost all of the data client-side. 
Note that both "instance of human" and "has a GND identifier" are not 
very selective properties. In this sense, the queries may not be 
"relatively easy to serve" in this particular case.


Markus



-Osma

[1] http://linkeddatafragments.org/

[2]
https://github.com/LinkedDataFragments/Server.js#configure-the-data-sources





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Osma Suominen

12.02.2016, 10:43, Markus Krötzsch wrote:


Restricting queries syntactically to be "simpler" is what we did in
Semantic MediaWiki (because MySQL did not support time/memory limits per
query). It is a workaround, but it will not prevent long-running queries
unless you make the syntactic restrictions really severe (and thereby
forbid many simple queries, too). I would not do it if there is support
for time/memory limits instead.


Would providing a Linked Data Fragments server [1] help here? It seems 
to be designed exactly for situations like this, where you want to 
provide a SPARQL query service a large amount of linked data, but are 
worried about server performance particularly for complex, long-running 
queries. Linked Data Fragments pushes some of the heavy processing to 
the client side, which parses and executes the SPARQL queries.


Dynamically updating the data might be an issue here, but some of the 
server implementations support running on top of a SPARQL endpoint [2]. 
I think that from the perspective of the server this means that a heavy, 
long-running SPARQL query is broken up already on the client side into 
several small, simple SPARQL queries that are relatively easy to serve.


-Osma

[1] http://linkeddatafragments.org/

[2] 
https://github.com/LinkedDataFragments/Server.js#configure-the-data-sources



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Markus Krötzsch

On 12.02.2016 00:04, Stas Malyshev wrote:

Hi!


We basically have two choices: either we offer a limited interface that only
allows for a narrow range of queries to be run at all. Or we offer a very
general interface that can run arbitrary queries, but we impose limits on time
and memory consumption. I would actually prefer the first option, because it's
more predictable, and doesn't get people's hopes up too far. What do you think?


That would require implementing pretty smart SPARQL parser... I don't
think it worth the investment of time. I'd rather put caps on runtime
and maybe also on parallel queries per IP, to ensure fair access. We may
also have a way to run longer queries - in fact, we'll need it anyway if
we want to automate lists - but that is longer term, we'll need to
figure out infrastructure for that and how we allocate access.


+1

Restricting queries syntactically to be "simpler" is what we did in 
Semantic MediaWiki (because MySQL did not support time/memory limits per 
query). It is a workaround, but it will not prevent long-running queries 
unless you make the syntactic restrictions really severe (and thereby 
forbid many simple queries, too). I would not do it if there is support 
for time/memory limits instead.


In the end, even the SPARQL engines are not able to predict reliably how 
complicated a query is going to be -- it's an important part of their 
work (for optimising query execution), but it is also very difficult.


Markus






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata