Re: Again on endpoint server limits [WAS Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.]

Kingsley Idehen Thu, 30 May 2013 07:55:18 -0700

On 5/30/13 10:42 AM, Andrea Splendiani wrote:

Hi,


good.

A different http header would be good.
The problem is that, in a typical application (or at least some of them) you 
don't really know which server is there, so specific headers (non http) may not 
be known.
More in general, I think the wider public won't be so precise: either you get 
results or not. Or perhaps you get some very visible thing that results are 
partial (to the point you need to change something in your code). Otherwise 
things go easily undetected.
Also, many times things are wrapped by libraries, so you just get the data out 
of a query without knowing details of what happened.

I'll try the lod cloud cache, thanks for the link.

Overall, I think that for many users, it would much be safer a X0X: query 
exceeding resources than some partial results. Would it possible to configure 
the preferred behavior at query time ?

Maybe.

At this point, we are working to get more into HTTP standard headers re., partial results. After that, we might consider a SPARQL pragma for results behavior.


Kingsley


best,
Andrea

Il giorno 30/mag/2013, alle ore 15:33, Kingsley Idehen <kide...@openlinksw.com> 
ha scritto:

On 5/30/13 9:13 AM, Andrea Splendiani wrote:

Hi,

let me get back to this thread for two reasons.
1) I was wondering whether the report on DBPedia queries cited below was 
already published.
2) I have recently tried to use DBPedia for some simple computation and I have 
a problem. Basically a query for all cities whose population is larger than 
that of the countries they are in returns a random number of results. I suspect 
this is due to hitting some internal computation load limits, and there is not 
much I can do with limits, I think, as results are no more than 20 o so.

Now, I discovered this by chance. If this due to some limits, I would much 
better prefer an error message (query too expensive) than partial results.
Is there a way to detect that these results are partial ?

Of course, via the response headers of the SPARQL query:

1. X-SQL-State:
2. X-SQL-Message:

We are also looking at using HTTP a bit more here i.e., not returning 200 OK if 
the resultset is partial.

Otherwise, there is s full range of use cases that gets problematic. I know 
dbpedia is a best effort free resource, so I understand the need for limits, 
and unpredictable results are good enough for many demos. But being unable to 
tell if a result is complete or not is a big constraint in many applications

Also remember, as I've indicated repeatedly, you can get the same data from the 
LOD cloud cache instance which is supported by more powerful computing 
resources:

1. http://lod.openlinksw.com/sparql -- the fastest instance since its a V7 
cluster (even though it hosts 50 Billion+ triples)
2. http://dbpedia-live.openlinksw.com -- still faster than dbpedia.org since 
its using V7 .


Kingsley

best,
Andrea


Il giorno 19/apr/2013, alle ore 02:54, Kingsley Idehen <kide...@openlinksw.com> 
ha scritto:

On 4/18/13 7:06 PM, Andrea Splendiani wrote:

Il giorno 18/apr/2013, alle ore 16:04, Kingsley Idehen <kide...@openlinksw.com> 
ha scritto:

On 4/18/13 9:23 AM, Andrea Splendiani wrote:

Hi,

I think that some caching with a minimum of query rewriting would get read of 
90% of the select{?s ?p ?o} where {?s?p ?o} queries.

Sorta.
Client queries are inherently unpredictable. That's always been the case, and 
that predates SPARQL. These issues also exist in the SQL RDBMS realm, which is 
why you don't have SQL endpoints delivering what SPARQL endpoints provide.

I know, but I suspect that these days lot of these "intensive" queries are 
explorative, just to check what is in the dataset, and may end up being very similar in 
structure.

Note, we have logs and recordings of queries that hit many of our public 
endpoints. For instance, we are preparing a report on DBpedia that will 
actually shed light on types and complexity of queries that hit the DBpedia 
endpoint.

Jerven: can you report on your experience in this ? How much of problematic 
queries are not really targeted, but more generic ?

 From a user perspective, I would rather have a clear result code upfront 
telling me: your query is to heavy, not enough resources and so on, than 
partial results + extra codes.

Yes, and you get that in some solutions e.g., what we provide. Basically, our 
server (subject to capacity) will tell you immediately that your query exceeds 
the query cost limits (this is different from timeout limits). The 
aforementioned feature was critical to getting the DBpedia SPARQL endpoint 
going, years ago.

Can you make a precise estimation of the query cost, or do you rely on some 
heuristics ?

We have a query cost optimizer. It handles native and distributed queries. Of 
course, query optimization is a universe onto itself, but over the years we've 
continually applied what we've learned about queries into its continued 
evolution.

I won't do much of partial results anyway... so it's time wasted both sides.

Not in a world where you have a public endpoint and zero control over the 
queries issued by clients.
Not in a world where you to provide faceted navigation over entity relations as part of a 
"precision find" style service atop RDF based Linked Data etc..

I mean, partial results are ok if I have control on which part it is... a 
system-dependent random subset of results is not much useful (not even for 
statistics!)

You have control with our system because you are basically given the ability to 
retry using a heuristic that increases the total processing time per retry, at 
the same time, while you are making up your mind about whether retry or not, 
there are background activities running in relation your last query. Remember, 
query processing is comprised of many parts:

1. parsing
2. costing
3. solution
4. actual retrieval.

Many see 1-4 as a monolith. Not so when dealing with DBMS processing. Again, 
this is novel, its quite old in the RDBMS realm.

One empiric solution could be to assign a quota per requesting IP (or other 
form of identification).

That's but one coarse-grained factor. You need to be able to associate a user 
agent (human or machine) profile with what ever quality of service you seek to 
scope to said profile. Again, this is the kind of thing we offer by leveraging 
WebID, Inference, and RDF right inside the core DBMS engine.

I agree. The finer the better. The IP-based approach is perhaps relatively easy 
to implement if not much is provided by the system.

  Then one could restrict the total amount of resource per time-frame, possibly 
with smart policies.

"Smart Policies" are the kind of thing you produce by exploiting the kind or 
entity relationship semantics baked into RDF based Linked Data. Basically, OWL (which is 
all about describing entity types and relation types semantics) serves this purpose very 
well. We certainly put it to use in our data access policy system which enables us to 
offer different capabilities and resource consumption to different human- or 
machine-agent profiles.

How do you use OWL for this ?

We just have normal RDF graphs that describe data access policies. All you need 
is a Linked Data URI that denotes an Agent (human or machine) , agent profile 
oriented ontology (e.g., FOAF and the like) that defines entity types and 
relation types, a Web access control ontology, actual entity relations based on 
the aforementioned ontologies, and reasoning capability.  Basically, using RDF 
to do what it's actually designed to do.

It would also avoid people breaking big queries in many small ones...

You can't avoid bad or challenging queries. What you can do is look to 
fine-grained data access policies (semantically enhanced ACLs) to address this 
problem. This has always been the challenge, even before the emergence of the 
whole Semantic Web , RDF etc.. The same challenges also dogged the RDBMS realm. 
There is no dancing around this matter when dealing with traditional RDBMS or 
Web oriented data access.

But I was wondering: why is resource consumption a problem for sparql endpoint providers, 
and not for other "providers" on the web ? (say, YouTube, Google, ...).
Is it the unpredictability of the resources needed ?

Good question!

They hide the problem behind airport sized data centers, and then they get you 
to foot the bill via your profile data which ultimately compromises your 
privacy.

Isn't the same possible with sparql, in principle ?

Sorta.

The ultimate destination, in our opinion is this, which setup provides the most 
cost-effective solution for Linked Data exploitation, at Web-scale. Basically, 
how many machines do you need to provide acceptable performance to a variety of 
user and agent profiles. We don't believe you need an airport sized data center 
to pull that off. The LOD cloud cache is just an 12-node Virtuoso cluster split 
across 4 machines.

"OpenLink Virtuoso version 07.00.3202, on Linux (x86_64-unknown-linux-gnu), Cluster 
Edition(12 server processes, 756 GB total memory)"

That's at the footer of the system home page: http://lod.openlinksw.com. 
Likewise, we expose timing and resource utilization data per query processed 
via that interface.

Although, I guess if a company would know that you spy on their queries... 
there would be some issue (unlike for users and facebook, for some reason).

It works like this:

1. we put out a public endpoint
2. we allow the public certain kinds of access e.g., like the DBpedia fair use 
policy we have in place
3. we can provide special access to specific agents based on data access policy 
graphs scoped to their WebIDs or other types of identifiers.

Kingsley

best,
Andrea

This is a problem, and it's the ultimately basis for showcasing what RDF 
(entity relationship based data model endowed with *explicit* rather than 
*implicit* human- and machine-readable entity relationship semantics) is 
actually all about.


Kingsley

best,
Andrea

Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman 
<jerven.bolle...@isb-sib.ch> ha scritto:

Hi All,

Managing a public SPARQL endpoint has some difficulties in comparison to 
managing a simpler REST api.
Instead of counting api calls or external bandwidth use we need to look at 
internal IO and CPU usage as well.

Many of the current public SPARQL endpoints limit all their users to queries of 
limited CPU time.
But this is not enough to really manage (mis) use of an endpoint. Also the 
SPARQL api being http based
suffers from the problem that we first send the status code and may only find 
out later that we can't
answer the query after all. Leading to a 200 not OK problem :(

What approaches can we come up with as a community to embedded resource limit 
exceeded exceptions in the
SPARQL protocols. e.g. we could add an exception element to the sparql xml 
result format.[1]

The current limits to CPU use are not enough to really avoid misuse. Which is 
why I submitted a patch to
Sesame that allows limits on memory use as well. Although limits on disk seeks 
or other IO counts may be needed by some as well.

But these are currently hard limits what I really want is
"playground limits" i.e. you can use the swing as much as you want if you are 
the only child in the park.
Once there are more children you have to share.

And how do we communicate this to our users. i.e. this result set is incomplete 
because you exceeded your IO
quota please break up your queries in smaller blocks.

For my day job where I do manage a 7.4 billion triple store with public access 
some extra tools in managing users would be
great.

Last but not least how can we avoid that users need to run SELECT 
(COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
For beta.sparql.uniprot.org I have been moving much of this information into 
the sparql endpoint description but its not a place
where people look for this information.

Regards,
Jerven

[1] Yeah these ideas are not great timing just after 1.1 but we can always 
start SPARQL 1.2 ;)



-------------------------------------------------------------------
Jerven Bolleman                        jerven.bolle...@isb-sib.ch
SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.isb-sib.ch - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

--

Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

--

Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen


--

Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen



--

Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Again on endpoint server limits [WAS Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.]

Reply via email to