subject:"\[Wikidata\] SPARQL CONSTRUCT results truncated"


On 13.02.2016 23:56, Kingsley Idehen wrote:

On 2/13/16 4:56 PM, Markus Kroetzsch wrote:

...


For a page-size of 20 (covered by LIMIT) you can move through offets of
20 via:


To clarify: I just added the LIMIT to prevent unwary readers from 
killing their browser on a 100MB HTML result page. The server does not 
need it at all and can give you all result at once. Online applications 
may still want to scroll results, I agree, but for the OP it would be 
more useful to just donwload one file here.


Markus




First call:

PREFIX wd: 
PREFIX wdt: 

select ?item ?gndId
where {
   ?item wdt:P227 ?gndId ; # get gnd ID
 wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) OFFSET 10 LIMIT 10


Next call:


PREFIX wd: 
PREFIX wdt: 

select ?item ?gndId
where {
   ?item wdt:P227 ?gndId ; # get gnd ID
 wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) OFFSET 20 LIMIT 10

Subsequent Calls:

PREFIX wd: 
PREFIX wdt: 

select ?item ?gndId
where {
   ?item wdt:P227 ?gndId ; # get gnd ID
 wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) OFFSET {last-offset-plus-20} LIMIT 10


Remember, you simply change the OFFSET value in the SPARQL HTTP URL.



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated


On 13.02.2016 23:50, Kingsley Idehen wrote:
...

Markus and others interested in this matter,

What about using OFFSET and LIMIT to address this problem? That's what
we advice users of the DBpedia endpoint (and others we publish) to do.

We have to educate people about query implications and options. Even
after that, you have the issue of timeouts (which aren't part of the
SPARQL spec) that can be used to produce partial results (notified via
HTTP headers), but that's something that comes after the basic scrolling
functionality of OFFSET and LIMIT are understood.


I think this does not help here. If I only ask for part of the data (see 
my previous email), I can get all 300K results in 9.3sec. The size of 
the result does not seem to be the issue. If I add further joins to the 
query, the time needed seems to go above 10sec (timeout) even with a 
LIMIT. Note that you need to order results for using LIMIT in a reliable 
way, since the data changes by the minute and the "natural" order of 
results would change as well. I guess with a blocking operator like 
ORDER BY in the equation, the use of LIMIT does not really save much 
time (other than for final result serialisation and transfer, which 
seems pretty quick).


Markus



[1]
http://stackoverflow.com/questions/20937556/how-to-get-all-companies-from-dbpedia
[2] https://sourceforge.net/p/dbpedia/mailman/message/29172307/



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-13 Thread Kingsley Idehen

On 2/13/16 4:56 PM, Markus Kroetzsch wrote:
> And here is another comment on this interesting topic :-)
>
> I just realised how close the service is to answering the query. It
> turns out that you can in fact get the whole set of (currently >324000
> result items) together with their GND identifiers as a download
> *within the timeout* (I tried several times without any errors). This
> is a 63M json result file with >640K individual values, and it
> downloads in no time on my home network. The query I use is simply this:
>
> PREFIX wd: 
> PREFIX wdt: 
>
> select ?item ?gndId
> where {
>   ?item wdt:P227 ?gndId ; # get gnd ID
> wdt:P31  wd:Q5  . # instance of human
> } ORDER BY ASC(?gndId) LIMIT 10
>
> (don't run this in vain: even with the limit, the ORDER clause
> requires the service to compute all results every time someone runs
> this. Also be careful when removing the limit; your browser may hang
> on an HTML page that large; better use the SPARQL endpoint directly to
> download the complete result file.)
>
> It seems that the timeout is only hit when adding more information
> (labels and wiki URLs) to the result.
>
> So it seems that we are not actually very far away from being able to
> answer the original query even within the timeout. Certainly not as
> far away as I first thought. It might not be necessary at all to
> switch to a different approach (though it would be interesting to know
> how long LDF takes to answer the above -- our current service takes
> less than 10sec).
>
> Cheers,
>
> Markus 

For a page-size of 20 (covered by LIMIT) you can move through offets of
20 via:


First call:

PREFIX wd: 
PREFIX wdt: 

select ?item ?gndId
where {
  ?item wdt:P227 ?gndId ; # get gnd ID
wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) OFFSET 10 LIMIT 10


Next call:


PREFIX wd: 
PREFIX wdt: 

select ?item ?gndId
where {
  ?item wdt:P227 ?gndId ; # get gnd ID
wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) OFFSET 20 LIMIT 10

Subsequent Calls:

PREFIX wd: 
PREFIX wdt: 

select ?item ?gndId
where {
  ?item wdt:P227 ?gndId ; # get gnd ID
wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) OFFSET {last-offset-plus-20} LIMIT 10


Remember, you simply change the OFFSET value in the SPARQL HTTP URL.

-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software 
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this




smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-13 Thread Kingsley Idehen

On 2/11/16 9:25 AM, Markus Krötzsch wrote:
> On 11.02.2016 15:01, Gerard Meijssen wrote:
>> Hoi,
>> What I hear is that the intentions were wrong in that you did not
>> anticipate people to get actual meaningful requests out of it.
>>
>> When you state "we have two choices", you imply that it is my choice as
>> well. It is not. The answer that I am looking for is yes, it does not
>> function as we would like, we are working on it and in the mean time we
>> will ensure that toolkit is available on Labs for the more complex
>> queries.
>>
>> Wikidata is a service and the service is in need of being better.
>
> Gerard, do you realise how far away from technical reality your wishes
> are? We are far ahead of the state of the art in what we already have
> for Wikidata: two powerful live query services + a free toolkit for
> batch analyses + several Web APIs for live lookups. I know of no site
> of this scale that is anywhere near this in terms of functionality.
> You can always ask for more, but you should be a bit reasonable too,
> or people will just ignore you.
>
> Markus 

Markus and others interested in this matter,

What about using OFFSET and LIMIT to address this problem? That's what
we advice users of the DBpedia endpoint (and others we publish) to do.

We have to educate people about query implications and options. Even
after that, you have the issue of timeouts (which aren't part of the
SPARQL spec) that can be used to produce partial results (notified via
HTTP headers), but that's something that comes after the basic scrolling
functionality of OFFSET and LIMIT are understood.

[1]
http://stackoverflow.com/questions/20937556/how-to-get-all-companies-from-dbpedia
[2] https://sourceforge.net/p/dbpedia/mailman/message/29172307/

-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software 
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this




smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated


On 13.02.2016 22:56, Markus Kroetzsch wrote:

And here is another comment on this interesting topic :-)

I just realised how close the service is to answering the query. It
turns out that you can in fact get the whole set of (currently >324000
result items) together with their GND identifiers as a download *within
the timeout* (I tried several times without any errors). This is a 63M
json result file with >640K individual values, and it downloads in no
time on my home network. The query I use is simply this:

PREFIX wd: 
PREFIX wdt: 

select ?item ?gndId
where {
   ?item wdt:P227 ?gndId ; # get gnd ID
 wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) LIMIT 10

(don't run this in vain: even with the limit, the ORDER clause requires
the service to compute all results every time someone runs this. Also be
careful when removing the limit; your browser may hang on an HTML page
that large; better use the SPARQL endpoint directly to download the
complete result file.)


P.S. For those who are interested, here is the direct link to the 
complete result (remove the line break [1]):


https:
//query.wikidata.org/sparql?query=PREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0Aselect+%3Fitem+%3FgndId+where+{+%3Fitem+wdt%3AP227+%3FgndId+%3B+wdt%3AP31++wd%3AQ5+.+}+ORDER+BY+ASC%28%3FgndId%29&format=json

Markus

[1] Is the service protected against internet crawlers that find such 
links in the online logs of this email list? It would be a pity if we 
would have to answer this query tens of thousands of times for many 
years to come just to please some spiders who have no use for the result.




It seems that the timeout is only hit when adding more information
(labels and wiki URLs) to the result.

So it seems that we are not actually very far away from being able to
answer the original query even within the timeout. Certainly not as far
away as I first thought. It might not be necessary at all to switch to a
different approach (though it would be interesting to know how long LDF
takes to answer the above -- our current service takes less than 10sec).

Cheers,

Markus


On 13.02.2016 11:40, Peter Haase wrote:

Hi,

you may want to check out the Linked Data Fragment server in Blazegraph:
https://github.com/blazegraph/BlazegraphBasedTPFServer

Cheers,
Peter

On 13.02.2016, at 01:33, Stas Malyshev  wrote:

Hi!


The Linked data fragments approach Osma mentioned is very interesting
(particularly the bit about setting it up on top of an regularily
updated existing endpoint), and could provide another alternative,
but I have not yet experimented with it.


There is apparently this: https://github.com/CristianCantoro/wikidataldf
though not sure what it its status - I just found it.

In general, yes, I think checking out LDF may be a good idea. I'll put
it on my todo list.

--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata







--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated


And here is another comment on this interesting topic :-)

I just realised how close the service is to answering the query. It 
turns out that you can in fact get the whole set of (currently >324000 
result items) together with their GND identifiers as a download *within 
the timeout* (I tried several times without any errors). This is a 63M 
json result file with >640K individual values, and it downloads in no 
time on my home network. The query I use is simply this:


PREFIX wd: 
PREFIX wdt: 

select ?item ?gndId
where {
  ?item wdt:P227 ?gndId ; # get gnd ID
wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) LIMIT 10

(don't run this in vain: even with the limit, the ORDER clause requires 
the service to compute all results every time someone runs this. Also be 
careful when removing the limit; your browser may hang on an HTML page 
that large; better use the SPARQL endpoint directly to download the 
complete result file.)


It seems that the timeout is only hit when adding more information 
(labels and wiki URLs) to the result.


So it seems that we are not actually very far away from being able to 
answer the original query even within the timeout. Certainly not as far 
away as I first thought. It might not be necessary at all to switch to a 
different approach (though it would be interesting to know how long LDF 
takes to answer the above -- our current service takes less than 10sec).


Cheers,

Markus


On 13.02.2016 11:40, Peter Haase wrote:

Hi,

you may want to check out the Linked Data Fragment server in Blazegraph:
https://github.com/blazegraph/BlazegraphBasedTPFServer

Cheers,
Peter

On 13.02.2016, at 01:33, Stas Malyshev  wrote:

Hi!


The Linked data fragments approach Osma mentioned is very interesting
(particularly the bit about setting it up on top of an regularily
updated existing endpoint), and could provide another alternative,
but I have not yet experimented with it.


There is apparently this: https://github.com/CristianCantoro/wikidataldf
though not sure what it its status - I just found it.

In general, yes, I think checking out LDF may be a good idea. I'll put
it on my todo list.

--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-13 Thread Peter Haase

Hi,

you may want to check out the Linked Data Fragment server in Blazegraph:
https://github.com/blazegraph/BlazegraphBasedTPFServer

Cheers,
Peter
> On 13.02.2016, at 01:33, Stas Malyshev  wrote:
> 
> Hi!
> 
>> The Linked data fragments approach Osma mentioned is very interesting
>> (particularly the bit about setting it up on top of an regularily
>> updated existing endpoint), and could provide another alternative,
>> but I have not yet experimented with it.
> 
> There is apparently this: https://github.com/CristianCantoro/wikidataldf
> though not sure what it its status - I just found it.
> 
> In general, yes, I think checking out LDF may be a good idea. I'll put
> it on my todo list.
> 
> -- 
> Stas Malyshev
> smalys...@wikimedia.org
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Stas Malyshev

Hi!

> The Linked data fragments approach Osma mentioned is very interesting
> (particularly the bit about setting it up on top of an regularily
> updated existing endpoint), and could provide another alternative,
> but I have not yet experimented with it.

There is apparently this: https://github.com/CristianCantoro/wikidataldf
though not sure what it its status - I just found it.

In general, yes, I think checking out LDF may be a good idea. I'll put
it on my todo list.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Neubert, Joachim

It's great how this discussion evolves - thanks to everybody!

Technically, I completely agree that in practice it may prove impossible to 
predict the load a query will produce. Relational databases have invested years 
and years in query optimization (e.g., Oracles cost based optimizer, which 
relies on extended statistics gathered during runtime), and I can't see that 
similar investments are possible for triple stores.

What I could imagine for public endpoints is the SPARQL engine monitoring and 
prioritizing queries: the longer a query already runs, or the more resources it 
has already used, the lower its priority is re-scheduled (up to some final 
limit). But this is just a theoretical consideration, I'm not aware of any 
system that implements anything like this - and it could be implemented only in 
the engine itself.

For ZBWs SPARQL endpoints, I've implemented a much simpler three-level 
strategy, which does not involve the engine at all:

1. Endpoints which drive production-level services (e.g. autosuggest or 
retrieval enhancement functions). These endpoints run on separate machines and 
offer completely encapsulated services via a public API 
(http://zbw.eu/beta/econ-ws), without any direct SPARQL access.

2. Public "beta" endpoints (http://zbw.eu/beta/sparql). These offer 
unrestricted SPARQL access, but without any garanties about performance or 
availability - though of course I do my best to keep these up and running. They 
run on an own virtual machine, and should not hurt any other parts of the 
infrastructure when getting overloaded or out of control.

3. Public "experimental" endpoints. These include in particular an endpoint for 
the GND dataset with 130m triples. It was mainly created for internal use 
because (to my best knowledge) no other public GND endpoint exists. The 
endpoint is not linked from the GND pages of DNB, and I've advertised it very 
low-key on a few mailing lists. For these experimental endpoints, we reserve 
the right to shut them down for the public if they get flooded with more 
requests than they can handle.

It may be of interest, that up to now, on none of these public endpoints we 
came across issues with attacks or evil-minded queries (which were a matter of 
concern when I started this in 2009), nor with longer-lasting massive access. 
Of course, that is different for Wikidata, where the data is of interest for 
_much_ more people. But if anyhow affordable, I'd like to encourage offering 
some kind of experimental access with really wide limits in an "unstable" 
setting, in addition to the reliable services. For most people who just want to 
check out something, it's not an option to download the whole dataset and set 
up an infrastructure for it. For us, this was an issue with even the much 
smaller GND set.

The Linked data fragments approach Osma mentioned is very interesting 
(particularly the bit about setting it up on top of an regularily updated 
existing endpoint), and could provide another alternative, but I have not yet 
experimented with it.

Have a fine weekend - Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Markus Krötzsch
Gesendet: Freitag, 12. Februar 2016 09:44
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

On 12.02.2016 00:04, Stas Malyshev wrote:
> Hi!
>
>> We basically have two choices: either we offer a limited interface 
>> that only allows for a narrow range of queries to be run at all. Or 
>> we offer a very general interface that can run arbitrary queries, but 
>> we impose limits on time and memory consumption. I would actually 
>> prefer the first option, because it's more predictable, and doesn't get 
>> people's hopes up too far. What do you think?
>
> That would require implementing pretty smart SPARQL parser... I don't 
> think it worth the investment of time. I'd rather put caps on runtime 
> and maybe also on parallel queries per IP, to ensure fair access. We 
> may also have a way to run longer queries - in fact, we'll need it 
> anyway if we want to automate lists - but that is longer term, we'll 
> need to figure out infrastructure for that and how we allocate access.

+1

Restricting queries syntactically to be "simpler" is what we did in Semantic 
MediaWiki (because MySQL did not support time/memory limits per query). It is a 
workaround, but it will not prevent long-running queries unless you make the 
syntactic restrictions really severe (and thereby forbid many simple queries, 
too). I would not do it if there is support for time/memory limits instead.

In the end, even the SPARQL engines are not able to predict reliably how 
complicated a query is going to be -- it's an important part of their work (for

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Markus Krötzsch


On 12.02.2016 10:01, Osma Suominen wrote:

12.02.2016, 10:43, Markus Krötzsch wrote:


Restricting queries syntactically to be "simpler" is what we did in
Semantic MediaWiki (because MySQL did not support time/memory limits per
query). It is a workaround, but it will not prevent long-running queries
unless you make the syntactic restrictions really severe (and thereby
forbid many simple queries, too). I would not do it if there is support
for time/memory limits instead.


Would providing a Linked Data Fragments server [1] help here? It seems
to be designed exactly for situations like this, where you want to
provide a SPARQL query service a large amount of linked data, but are
worried about server performance particularly for complex, long-running
queries. Linked Data Fragments pushes some of the heavy processing to
the client side, which parses and executes the SPARQL queries.

Dynamically updating the data might be an issue here, but some of the
server implementations support running on top of a SPARQL endpoint [2].
I think that from the perspective of the server this means that a heavy,
long-running SPARQL query is broken up already on the client side into
several small, simple SPARQL queries that are relatively easy to serve.


There already is such a service for Wikidata (Cristian Consonni has set 
it up a while ago). You could try if the query would work there. I think 
that such queries would be rather challenging for a server of this type, 
since they require you to iterate almost all of the data client-side. 
Note that both "instance of human" and "has a GND identifier" are not 
very selective properties. In this sense, the queries may not be 
"relatively easy to serve" in this particular case.


Markus



-Osma

[1] http://linkeddatafragments.org/

[2]
https://github.com/LinkedDataFragments/Server.js#configure-the-data-sources





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Osma Suominen


12.02.2016, 10:43, Markus Krötzsch wrote:


Restricting queries syntactically to be "simpler" is what we did in
Semantic MediaWiki (because MySQL did not support time/memory limits per
query). It is a workaround, but it will not prevent long-running queries
unless you make the syntactic restrictions really severe (and thereby
forbid many simple queries, too). I would not do it if there is support
for time/memory limits instead.


Would providing a Linked Data Fragments server [1] help here? It seems 
to be designed exactly for situations like this, where you want to 
provide a SPARQL query service a large amount of linked data, but are 
worried about server performance particularly for complex, long-running 
queries. Linked Data Fragments pushes some of the heavy processing to 
the client side, which parses and executes the SPARQL queries.


Dynamically updating the data might be an issue here, but some of the 
server implementations support running on top of a SPARQL endpoint [2]. 
I think that from the perspective of the server this means that a heavy, 
long-running SPARQL query is broken up already on the client side into 
several small, simple SPARQL queries that are relatively easy to serve.


-Osma

[1] http://linkeddatafragments.org/

[2] 
https://github.com/LinkedDataFragments/Server.js#configure-the-data-sources



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Markus Krötzsch


On 12.02.2016 00:04, Stas Malyshev wrote:

Hi!


We basically have two choices: either we offer a limited interface that only
allows for a narrow range of queries to be run at all. Or we offer a very
general interface that can run arbitrary queries, but we impose limits on time
and memory consumption. I would actually prefer the first option, because it's
more predictable, and doesn't get people's hopes up too far. What do you think?


That would require implementing pretty smart SPARQL parser... I don't
think it worth the investment of time. I'd rather put caps on runtime
and maybe also on parallel queries per IP, to ensure fair access. We may
also have a way to run longer queries - in fact, we'll need it anyway if
we want to automate lists - but that is longer term, we'll need to
figure out infrastructure for that and how we allocate access.


+1

Restricting queries syntactically to be "simpler" is what we did in 
Semantic MediaWiki (because MySQL did not support time/memory limits per 
query). It is a workaround, but it will not prevent long-running queries 
unless you make the syntactic restrictions really severe (and thereby 
forbid many simple queries, too). I would not do it if there is support 
for time/memory limits instead.


In the end, even the SPARQL engines are not able to predict reliably how 
complicated a query is going to be -- it's an important part of their 
work (for optimising query execution), but it is also very difficult.


Markus






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Stas Malyshev

Hi!

> For me, it’s perfectly ok when a query runs for 20 minutes, when it
> spares me some hours of setting up a specific environment for one
> specific dataset (and doing it again when I need current data two month
> later). And it would be no issue if the query runs much longer, in
> situations where it competes with several others. But of course, that’s
> not what I want to experience when I use a wikidata service to drive,
> e.g., an autosuggest function for selecting entities.

I understand that, but this is a shared server which is supposed to
serve many users, and if we allow to run 20-minute queries on this
service, soon enough it would become unusable. This is why we have
30-second limit on the server.

Now, we have considered having an option for the server or setup that
allows to run longer queries, but currently we don't have one. It would
require some budget allocation and work to make it, so it's not
something we can have right now. There are use cases for very long
queries and very large results, the current public service endpoint is
just not good in serving them, because it's not what it was meant for.

> And do you think the policies and limitations of different access
> strategies could be documented? These could include a high-reliability

I agree that limitations better to be documented, the problem is we
don't know everything we may need to document. Such as "what are queries
that may be bad". When I see something like "I want to download
million-row dataset" I know it's probably a bit too much. But I can't
have hard rule that says 1M-1 is ok, but 1M is too much.

> preferred option). And on the other end of the spectrum something what
> allows people to experiment freely. Finally, the latter kind of

I'm not sure how I could maintain an endpoint that would allow people to
do anything they want and still provide adequate experience for
everybody. Maybe if we had infinite hardware resources... but we do not.

Otherwise, it is possible - and should not be extremely hard - to set
one's own instance of the Query Service and use it for experimenting
with heavy lifting. Of course, that would require resources - but
there's no magic here, it'd require resources from us too, both in terms
of hardware and people that would maintain it. So some things we can do
now, some things we would be able to do later, and some things we
probably would not be able to offer with any adequate quality.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hoi,
This is the kind of (technical) feedback that makes sense as it is centred
on need. It acknowledges that more needs to be done as we are not ready for
what we expect of ourselves in the first place.

In this day and age of big data, we are a very public place where a lot of
initiatives gravitate to. If the WMF wants to retain its relevance, it is
to face its challenges. Maybe the WDQS can steal a page out of the
architecture in what Magnus build. It is very much replicable and multiple
instances have been running. This is not to say that it becomes more and
more relevant to have the Wikidata toolkit available from Labs with as many
instances as needed.
Thanks,
 GerardM

On 12 February 2016 at 00:04, Stas Malyshev  wrote:

> Hi!
>
> > We basically have two choices: either we offer a limited interface that
> only
> > allows for a narrow range of queries to be run at all. Or we offer a very
> > general interface that can run arbitrary queries, but we impose limits
> on time
> > and memory consumption. I would actually prefer the first option,
> because it's
> > more predictable, and doesn't get people's hopes up too far. What do you
> think?
>
> That would require implementing pretty smart SPARQL parser... I don't
> think it worth the investment of time. I'd rather put caps on runtime
> and maybe also on parallel queries per IP, to ensure fair access. We may
> also have a way to run longer queries - in fact, we'll need it anyway if
> we want to automate lists - but that is longer term, we'll need to
> figure out infrastructure for that and how we allocate access.
>
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Stas Malyshev

Hi!

> We basically have two choices: either we offer a limited interface that only
> allows for a narrow range of queries to be run at all. Or we offer a very
> general interface that can run arbitrary queries, but we impose limits on time
> and memory consumption. I would actually prefer the first option, because it's
> more predictable, and doesn't get people's hopes up too far. What do you 
> think?

That would require implementing pretty smart SPARQL parser... I don't
think it worth the investment of time. I'd rather put caps on runtime
and maybe also on parallel queries per IP, to ensure fair access. We may
also have a way to run longer queries - in fact, we'll need it anyway if
we want to automate lists - but that is longer term, we'll need to
figure out infrastructure for that and how we allocate access.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Stas Malyshev

Hi!

> 5.44s empty result
> 8.60s 2090 triples
> 5.44s empty result
> 22.70s 27352 triples

That looks weirdly random. I'll check out what is going on there.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated


Hi Joachim,

Stas would be the right person to discuss service parameters and the 
possible setup of more servers with other parameters. He is part of the 
team at WMF who is in charge of the SPARQL ops.


You note that "it isn’t always obvious what is right and what the 
limitations of a tool are". I think this is the key point here. There is 
not enough experience with the SPARQL service yet to define very clear 
guidelines on what works and what doesn't. On this mailing list, we have 
frequently been reminded to use LIMIT in queries to make sure they 
terminate and don't overstress the server, but I guess this is not part 
of the official documentation you refer to. There was no decision 
against supporting bigger queries either -- it just did not come up as a 
major demand yet, since typical applications that use SPARQL so far 
require 10s to 1000s of results but not 100,000s to millions. To be 
honest, I would not have expected this to work so well in practice that 
it could be considered here. It is interesting to learn that you are 
already using SPARQL for generating custom data exports. It's probably 
not the most typical use of a query service, but at least the query 
language could support this usage in principle.


Cheers,

Markus



On 11.02.2016 19:32, Neubert, Joachim wrote:

Hi Lydia,

I agree on using the right tool for the job. Yet, it isn’t always
obvious what is right and what the limitations of a tool are.

For me, it’s perfectly ok when a query runs for 20 minutes, when it
spares me some hours of setting up a specific environment for one
specific dataset (and doing it again when I need current data two month
later). And it would be no issue if the query runs much longer, in
situations where it competes with several others. But of course, that’s
not what I want to experience when I use a wikidata service to drive,
e.g., an autosuggest function for selecting entities.

So, can you agree to Markus suggestion that an experimental “unstable”
endpoint could solve different use cases and expectiations?

And do you think the policies and limitations of different access
strategies could be documented? These could include a high-reliability
interface for a narrow range of queries (as Daniel suggests as his
preferred option). And on the other end of the spectrum something what
allows people to experiment freely. Finally, the latter kind of
interface could allow new patterns of usage to evolve, with perhaps a
few of them worthwhile to become part of an optimized, highly reliabile
query set.

I could imagine that such a documentation of (and perhaps discussion on)
different options and access strategies, limitations and tradeoffs could
solve Gerards claim to give people what they need, or at least let them
make informed choises when restrictions are unavoidable.

Cheers, Joachim

*Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] *Im Auftrag
von *Lydia Pintscher
*Gesendet:* Donnerstag, 11. Februar 2016 17:55
*An:* Discussion list for the Wikidata project.
*Betreff:* Re: [Wikidata] SPARQL CONSTRUCT results truncated

On Thu, Feb 11, 2016 at 5:53 PM Gerard Meijssen
mailto:gerard.meijs...@gmail.com>> wrote:

Hoi,'

Markus when you read my reply on the original question you will see
that my approach is different. The first thing that I pointed out
was that a technical assumption has little to do with what people
need. I indicated that when this is the approach, the answer is fix
it. The notion that a large number of returns is outrageous is not
of this time.

My approach was one where I even offered a possible solution, a crutch.

The approach Daniel took was to make me look ridiculous. His choice,
not mine. I stayed polite and told him that his answers are not my
answers and why. The point that I make is that Wikidata is a
service. It will increasingly be used for the most outrageous
queries and people will expect it to work because why else do we put
all this data in there. Why else is this the data hub for Wikipedia.
Why else

Do appreciate that the aim of the WMF is to share in the sum of all
available knowledge. When the current technology is what we have to
make do with, fine for now. Say so, but do not ridicule me for
saying that it is not good enough, it is not now and it will
certainly not be in the future...

Thanks,

GerardM

Gerard, it all boils down to using the right tool for the job. Nothing
more - nothing less. Let's get back to making Wikidata rock.

Cheers
Lydia

--

Lydia Pintscher - http://about.me/lydia.pintscher

Product Manager for Wikidata

Wikimedia Deutschland e.V.

Tempelhofer Ufer 23-24

10963 Berlin

www.wikimedia.de <http://www.wikimedia.de>

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Neubert, Joachim

Hi Lydia,

I agree on using the right tool for the job. Yet, it isn’t always obvious what 
is right and what the limitations of a tool are.

For me, it’s perfectly ok when a query runs for 20 minutes, when it spares me 
some hours of setting up a specific environment for one specific dataset (and 
doing it again when I need current data two month later). And it would be no 
issue if the query runs much longer, in situations where it competes with 
several others. But of course, that’s not what I want to experience when I use 
a wikidata service to drive, e.g., an autosuggest function for selecting 
entities.

So, can you agree to Markus suggestion that an experimental “unstable” endpoint 
could solve different use cases and expectiations?

And do you think the policies and limitations of different access strategies 
could be documented? These could include a high-reliability interface for a 
narrow range of queries (as Daniel suggests as his preferred option). And on 
the other end of the spectrum something what allows people to experiment 
freely. Finally, the latter kind of interface could allow new patterns of usage 
to evolve, with perhaps a few of them worthwhile to become part of an 
optimized, highly reliabile query set.

I could imagine that such a documentation of (and perhaps discussion on) 
different options and access strategies, limitations and tradeoffs could solve 
Gerards claim to give people what they need, or at least let them make informed 
choises when restrictions are unavoidable.

Cheers, Joachim

Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Lydia Pintscher
Gesendet: Donnerstag, 11. Februar 2016 17:55
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

On Thu, Feb 11, 2016 at 5:53 PM Gerard Meijssen 
mailto:gerard.meijs...@gmail.com>> wrote:
Hoi,'
Markus when you read my reply on the original question you will see that my 
approach is different. The first thing that I pointed out was that a technical 
assumption has little to do with what people need. I indicated that when this 
is the approach, the answer is fix it. The notion that a large number of 
returns is outrageous is not of this time.

My approach was one where I even offered a possible solution, a crutch.

The approach Daniel took was to make me look ridiculous. His choice, not mine. 
I stayed polite and told him that his answers are not my answers and why. The 
point that I make is that Wikidata is a service. It will increasingly be used 
for the most outrageous queries and people will expect it to work because why 
else do we put all this data in there. Why else is this the data hub for 
Wikipedia. Why else

Do appreciate that the aim of the WMF is to share in the sum of all available 
knowledge. When the current technology is what we have to make do with, fine 
for now. Say so, but do not ridicule me for saying that it is not good enough, 
it is not now and it will certainly not be in the future...
Thanks,
   GerardM

Gerard, it all boils down to using the right tool for the job. Nothing more - 
nothing less. Let's get back to making Wikidata rock.


Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de<http://www.wikimedia.de>

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der 
Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für 
Körperschaften I Berlin, Steuernummer 27/029/42207.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hoi,'
Markus when you read my reply on the original question you will see that my
approach is different. The first thing that I pointed out was that a
technical assumption has little to do with what people need. I indicated
that when this is the approach, the answer is fix it. The notion that a
large number of returns is outrageous is not of this time.

My approach was one where I even offered a possible solution, a crutch.

The approach Daniel took was to make me look ridiculous. His choice, not
mine. I stayed polite and told him that his answers are not my answers and
why. The point that I make is that Wikidata is a service. It will
increasingly be used for the most outrageous queries and people will expect
it to work because why else do we put all this data in there. Why else is
this the data hub for Wikipedia. Why else

Do appreciate that the aim of the WMF is to share in the sum of all
available knowledge. When the current technology is what we have to make do
with, fine for now. Say so, but do not ridicule me for saying that it is
not good enough, it is not now and it will certainly not be in the future...
Thanks,
   GerardM

On 11 February 2016 at 15:25, Markus Krötzsch  wrote:

> On 11.02.2016 15:01, Gerard Meijssen wrote:
>
>> Hoi,
>> What I hear is that the intentions were wrong in that you did not
>> anticipate people to get actual meaningful requests out of it.
>>
>> When you state "we have two choices", you imply that it is my choice as
>> well. It is not. The answer that I am looking for is yes, it does not
>> function as we would like, we are working on it and in the mean time we
>> will ensure that toolkit is available on Labs for the more complex
>> queries.
>>
>> Wikidata is a service and the service is in need of being better.
>>
>
> Gerard, do you realise how far away from technical reality your wishes
> are? We are far ahead of the state of the art in what we already have for
> Wikidata: two powerful live query services + a free toolkit for batch
> analyses + several Web APIs for live lookups. I know of no site of this
> scale that is anywhere near this in terms of functionality. You can always
> ask for more, but you should be a bit reasonable too, or people will just
> ignore you.
>
> Markus
>
>
> On 11 February 2016 at 12:32, Daniel Kinzler
>> mailto:daniel.kinz...@wikimedia.de>> wrote:
>>
>> Am 11.02.2016 um 10:17 schrieb Gerard Meijssen:
>> > Your response is technical and seriously, query is a tool and it
>> should function
>> > for people. When the tool is not good enough fix it.
>>
>> What I hear: "A hammer is a tool, it should work for people. Tearing
>> down a
>> building with it takes forever, so fix the hammer!"
>>
>> The query service was never intended to run arbitrarily large or
>> complex
>> queries. Sure, would be nice, but that also means committing an
>> arbitrary amount
>> of resources to a single request. We don't have arbitrary amounts of
>> resources.
>>
>> We basically have two choices: either we offer a limited interface
>> that only
>> allows for a narrow range of queries to be run at all. Or we offer a
>> very
>> general interface that can run arbitrary queries, but we impose
>> limits on time
>> and memory consumption. I would actually prefer the first option,
>> because it's
>> more predictable, and doesn't get people's hopes up too far. What do
>> you think?
>>
>> Oh, and +1 for making it easy to use WDT on labs.
>>
>> --
>> Daniel Kinzler
>> Senior Software Developer
>>
>> Wikimedia Deutschland
>> Gesellschaft zur Förderung Freien Wissens e.V.
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org 
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Lydia Pintscher

On Thu, Feb 11, 2016 at 5:53 PM Gerard Meijssen 
wrote:

> Hoi,'
> Markus when you read my reply on the original question you will see that
> my approach is different. The first thing that I pointed out was that a
> technical assumption has little to do with what people need. I indicated
> that when this is the approach, the answer is fix it. The notion that a
> large number of returns is outrageous is not of this time.
>
> My approach was one where I even offered a possible solution, a crutch.
>
> The approach Daniel took was to make me look ridiculous. His choice, not
> mine. I stayed polite and told him that his answers are not my answers and
> why. The point that I make is that Wikidata is a service. It will
> increasingly be used for the most outrageous queries and people will expect
> it to work because why else do we put all this data in there. Why else is
> this the data hub for Wikipedia. Why else
>
> Do appreciate that the aim of the WMF is to share in the sum of all
> available knowledge. When the current technology is what we have to make do
> with, fine for now. Say so, but do not ridicule me for saying that it is
> not good enough, it is not now and it will certainly not be in the future...
> Thanks,
>GerardM
>

Gerard, it all boils down to using the right tool for the job. Nothing more
- nothing less. Let's get back to making Wikidata rock.


Cheers
Lydia
-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Neubert, Joachim

Hi Marcus,

thank you very much, your code will be extremely helpful for solving my current 
need. And though not a Java programmer, I may be even able to adjust it to 
similar queries.

On the other side, it's some steps away from the promises of Linked data and 
SPARQL endpoints. I extremely value the wikidata endpoint for having the 
current data, so if I add some bit in the user interface, I can query for it 
immediately afterwards, and I can do this in a uniform way via standard SPARQL 
queries. I can imagine how hard that was to achieve.

And I completely agree that it's impossible to build a SPARQL endpoint which 
reliably serves arbitrary comlex queries for multiple users in finite time. 
(This is the reason why all our public endpoints at http://zbw.eu/beta/sparql/ 
are labeled beta.) And you easily can get at a point, where some ill-behaved 
query is run over and over again by some stupid program, and you have to be 
quite restrictive to keep your service up.

So an "unstable" endpoint with wider limits, as you suggested in your later 
mail, could be a great solution for this. In both instances, it would be nice 
if the policy and the actual limits could be documented, so users would know 
what to expect (and how to act appropriate as good citizens).

Thanks again for the code, and for taking up the discussion.

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Markus Krötzsch
Gesendet: Donnerstag, 11. Februar 2016 15:05
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hi Joachim,

Here is a short program that solves your problem:

https://github.com/Wikidata/Wikidata-Toolkit-Examples/blob/master/src/examples/DataExtractionProcessor.java

It is in Java, so, you need that (and Maven) to run it, but that's the only 
technical challenge ;-). You can run the program in various ways as described 
in the README:

https://github.com/Wikidata/Wikidata-Toolkit-Examples

The program I wrote puts everything into a CSV file, but you can of course also 
write RDF triples if you prefer this, or any other format you wish. The code 
should be easy to modify.

On a first run, the tool will download the current Wikidata dump, which takes a 
while (it's about 6G), but after this you can find and serialise all results in 
less than half an hour (for a processing rate of around 10K items/second). A 
regular laptop is enough to run it.

Cheers,

Markus

On 11.02.2016 01:34, Stas Malyshev wrote:
> Hi!
>
>> I try to extract all mappings from wikidata to the GND authority 
>> file, along with the according wikipedia pages, expecting roughly 
>> 500,000 to 1m triples as result.
>
> As a starting note, I don't think extracting 1M triples may be the 
> best way to use query service. If you need to do processing that 
> returns such big result sets - in millions - maybe processing the dump 
> - e.g. with wikidata toolkit at 
> https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
>
>> However, with various calls, I get much less triples (about 2,000 to 
>> 10,000). The output seems to be truncated in the middle of a statement, e.g.
>
> It may be some kind of timeout because of the quantity of the data 
> being sent. How long does such request take?
>

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated


Hi Joachim,

I think the problem is not to answer your query in 5min or so (Wikidata 
Toolkit on my laptop takes 27min without a database, by simply parsing 
the whole data file, so any database that already has the data should be 
much faster). The bigger issue is that you would have to configure the 
site to run for 5min before timeout. This would mean that other queries 
that never terminate (because they are really hard) also can run for at 
least this time. It seems that this could easily cause the service to 
break down.


Maybe one could have an "unstable" service on a separate machine that 
does the same as WDQS but with a much more liberal timeout and less 
availability (if it's overloaded a lot, it will just be down more often, 
but you would know when you use it that this is the deal).


Cheers,

Markus


On 11.02.2016 15:54, Neubert, Joachim wrote:

Hi Stas,

Thanks for your answer. You asked how long the query runs: 8.21 sec (having 
processed 6443 triples), in an example invocation. If roughly linear, that 
could mean 800-1500 sec for the whole set. However, I would expect a clearly 
shorter runtime: I routinely use queries of similar complexity and result sizes 
on ZBW's public endpoints. One arbitrary selected query which extracts data 
from GND runs for less than two minutes to produce 1.2m triples.

Given the size of Wikidata, I wouldn't consider such an use abusive. Of course, 
if you have lots of competing queries and resources are limited, it is 
completely legitimate to implement some policy which formulates limits and 
enforces them technically (throddle down long-running queries, or limit the 
number of produced triples, or the execution time, or whatever seems reasonable 
and can be implemented).

Anyway, in this case (truncation in the middle of a statement), it looks much 
more like some technical bug (or an obscure timeout somewhere down the way). 
The execution time and the result size varies widely:

5.44s empty result
8.60s 2090 triples
5.44s empty result
22.70s 27352 triples

Can you reproduce this kind of results with the given query, or with other 
supposedly longer-running queries?

Thanks again for looking into this.

Cheers, Joachim

PS. I plan to set up an own Wikidata SPAQL endpoint to do more complex things, but that 
depends on a new machine which will be available in some month. For now, I'd just like to 
know which for "our" persons (economists and the like) have wikipedia pages.

PPS. From my side, I would much more have liked to build a query which asks for 
exactly the GND IDs I'm interested in (about 430.000 out of millions of GNDs). 
This would have led to a much smaller result - but I cannot squeeze that query 
into a GET request ...


-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von Stas 
Malyshev
Gesendet: Donnerstag, 11. Februar 2016 01:35
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hi!


I try to extract all mappings from wikidata to the GND authority file,
along with the according wikipedia pages, expecting roughly 500,000 to
1m triples as result.


As a starting note, I don't think extracting 1M triples may be the best way to 
use query service. If you need to do processing that returns such big result 
sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at 
https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?


However, with various calls, I get much less triples (about 2,000 to
10,000). The output seems to be truncated in the middle of a statement, e.g.


It may be some kind of timeout because of the quantity of the data being sent. 
How long does such request take?

--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Neubert, Joachim

Hi Stas,

Thanks for your answer. You asked how long the query runs: 8.21 sec (having 
processed 6443 triples), in an example invocation. If roughly linear, that 
could mean 800-1500 sec for the whole set. However, I would expect a clearly 
shorter runtime: I routinely use queries of similar complexity and result sizes 
on ZBW's public endpoints. One arbitrary selected query which extracts data 
from GND runs for less than two minutes to produce 1.2m triples.

Given the size of Wikidata, I wouldn't consider such an use abusive. Of course, 
if you have lots of competing queries and resources are limited, it is 
completely legitimate to implement some policy which formulates limits and 
enforces them technically (throddle down long-running queries, or limit the 
number of produced triples, or the execution time, or whatever seems reasonable 
and can be implemented). 

Anyway, in this case (truncation in the middle of a statement), it looks much 
more like some technical bug (or an obscure timeout somewhere down the way). 
The execution time and the result size varies widely:

5.44s empty result
8.60s 2090 triples
5.44s empty result
22.70s 27352 triples

Can you reproduce this kind of results with the given query, or with other 
supposedly longer-running queries?

Thanks again for looking into this.

Cheers, Joachim

PS. I plan to set up an own Wikidata SPAQL endpoint to do more complex things, 
but that depends on a new machine which will be available in some month. For 
now, I'd just like to know which for "our" persons (economists and the like) 
have wikipedia pages.

PPS. From my side, I would much more have liked to build a query which asks for 
exactly the GND IDs I'm interested in (about 430.000 out of millions of GNDs). 
This would have led to a much smaller result - but I cannot squeeze that query 
into a GET request ...


-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von Stas 
Malyshev
Gesendet: Donnerstag, 11. Februar 2016 01:35
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hi!

> I try to extract all mappings from wikidata to the GND authority file, 
> along with the according wikipedia pages, expecting roughly 500,000 to 
> 1m triples as result.

As a starting note, I don't think extracting 1M triples may be the best way to 
use query service. If you need to do processing that returns such big result 
sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at 
https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?

> However, with various calls, I get much less triples (about 2,000 to 
> 10,000). The output seems to be truncated in the middle of a statement, e.g.

It may be some kind of timeout because of the quantity of the data being sent. 
How long does such request take?

--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

On 11.02.2016 15:01, Gerard Meijssen wrote:

Hoi,
What I hear is that the intentions were wrong in that you did not
anticipate people to get actual meaningful requests out of it.

When you state "we have two choices", you imply that it is my choice as
well. It is not. The answer that I am looking for is yes, it does not
function as we would like, we are working on it and in the mean time we
will ensure that toolkit is available on Labs for the more complex queries.

Wikidata is a service and the service is in need of being better.

Gerard, do you realise how far away from technical reality your wishes 
are? We are far ahead of the state of the art in what we already have 
for Wikidata: two powerful live query services + a free toolkit for 
batch analyses + several Web APIs for live lookups. I know of no site of 
this scale that is anywhere near this in terms of functionality. You can 
always ask for more, but you should be a bit reasonable too, or people 
will just ignore you.

Markus

On 11 February 2016 at 12:32, Daniel Kinzler
mailto:daniel.kinz...@wikimedia.de>> wrote:

Am 11.02.2016 um 10:17 schrieb Gerard Meijssen:
> Your response is technical and seriously, query is a tool and it should 
function
> for people. When the tool is not good enough fix it.

What I hear: "A hammer is a tool, it should work for people. Tearing
down a
building with it takes forever, so fix the hammer!"

The query service was never intended to run arbitrarily large or complex
queries. Sure, would be nice, but that also means committing an
arbitrary amount
of resources to a single request. We don't have arbitrary amounts of
resources.

We basically have two choices: either we offer a limited interface
that only
allows for a narrow range of queries to be run at all. Or we offer a
very
general interface that can run arbitrary queries, but we impose
limits on time
and memory consumption. I would actually prefer the first option,
because it's
more predictable, and doesn't get people's hopes up too far. What do
you think?

Oh, and +1 for making it easy to use WDT on labs.

--
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hi Joachim,

Here is a short program that solves your problem:

https://github.com/Wikidata/Wikidata-Toolkit-Examples/blob/master/src/examples/DataExtractionProcessor.java

It is in Java, so, you need that (and Maven) to run it, but that's the
only technical challenge ;-). You can run the program in various ways as
described in the README:

https://github.com/Wikidata/Wikidata-Toolkit-Examples

The program I wrote puts everything into a CSV file, but you can of
course also write RDF triples if you prefer this, or any other format
you wish. The code should be easy to modify.

On a first run, the tool will download the current Wikidata dump, which
takes a while (it's about 6G), but after this you can find and serialise
all results in less than half an hour (for a processing rate of around
10K items/second). A regular laptop is enough to run it.

Cheers,

Markus

On 11.02.2016 01:34, Stas Malyshev wrote:

Hi!

I try to extract all mappings from wikidata to the GND authority file,
along with the according wikipedia pages, expecting roughly 500,000 to
1m triples as result.

As a starting note, I don't think extracting 1M triples may be the best
way to use query service. If you need to do processing that returns such
big result sets - in millions - maybe processing the dump - e.g. with
wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would
be better idea?

However, with various calls, I get much less triples (about 2,000 to
10,000). The output seems to be truncated in the middle of a statement, e.g.

It may be some kind of timeout because of the quantity of the data being
sent. How long does such request take?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hoi,
What I hear is that the intentions were wrong in that you did not
anticipate people to get actual meaningful requests out of it.

When you state "we have two choices", you imply that it is my choice as
well. It is not. The answer that I am looking for is yes, it does not
function as we would like, we are working on it and in the mean time we
will ensure that toolkit is available on Labs for the more complex queries.

Wikidata is a service and the service is in need of being better.
Thanks,
  GerardM

On 11 February 2016 at 12:32, Daniel Kinzler 
wrote:

> Am 11.02.2016 um 10:17 schrieb Gerard Meijssen:
> > Your response is technical and seriously, query is a tool and it should
> function
> > for people. When the tool is not good enough fix it.
>
> What I hear: "A hammer is a tool, it should work for people. Tearing down a
> building with it takes forever, so fix the hammer!"
>
> The query service was never intended to run arbitrarily large or complex
> queries. Sure, would be nice, but that also means committing an arbitrary
> amount
> of resources to a single request. We don't have arbitrary amounts of
> resources.
>
> We basically have two choices: either we offer a limited interface that
> only
> allows for a narrow range of queries to be run at all. Or we offer a very
> general interface that can run arbitrary queries, but we impose limits on
> time
> and memory consumption. I would actually prefer the first option, because
> it's
> more predictable, and doesn't get people's hopes up too far. What do you
> think?
>
> Oh, and +1 for making it easy to use WDT on labs.
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Daniel Kinzler

Am 11.02.2016 um 10:17 schrieb Gerard Meijssen:
> Your response is technical and seriously, query is a tool and it should 
> function
> for people. When the tool is not good enough fix it.

What I hear: "A hammer is a tool, it should work for people. Tearing down a
building with it takes forever, so fix the hammer!"

The query service was never intended to run arbitrarily large or complex
queries. Sure, would be nice, but that also means committing an arbitrary amount
of resources to a single request. We don't have arbitrary amounts of resources.

We basically have two choices: either we offer a limited interface that only
allows for a narrow range of queries to be run at all. Or we offer a very
general interface that can run arbitrary queries, but we impose limits on time
and memory consumption. I would actually prefer the first option, because it's
more predictable, and doesn't get people's hopes up too far. What do you think?

Oh, and +1 for making it easy to use WDT on labs.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] SPARQL CONSTRUCT results truncated