Re: Again on endpoint server limits [WAS Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.]

Andy Seaborne Mon, 03 Jun 2013 02:35:50 -0700

There's a practical tradeoff of streaming and whole result processing inthe server.

Streaming can give lower latency to first result for the client whichcan be a better user experience. HTTP status code go in the header soto insert the perfect answer, the server needs to see the end of thequery before the first bytes get sent. This is orthogonal to MarkBaker's point about returning a URL.

What would be a useful improvement is a formally defined marker in theresults saying "that's all folks - times up - incomplete results" i.e.in the standards results form). The client can then decide whihc modeis better user experience - to see all results before doing any appwork, or choose to work incrementally.


        Andy

On 30/05/13 15:52, Kingsley Idehen wrote:

On 5/30/13 10:42 AM, Andrea Splendiani wrote:

Hi,

good.

A different http header would be good.
The problem is that, in a typical application (or at least some of
them) you don't really know which server is there, so specific headers
(non http) may not be known.
More in general, I think the wider public won't be so precise: either
you get results or not. Or perhaps you get some very visible thing
that results are partial (to the point you need to change something in
your code). Otherwise things go easily undetected.
Also, many times things are wrapped by libraries, so you just get the
data out of a query without knowing details of what happened.

I'll try the lod cloud cache, thanks for the link.

Overall, I think that for many users, it would much be safer a X0X:
query exceeding resources than some partial results. Would it possible
to configure the preferred behavior at query time ?

Maybe.

At this point, we are working to get more into HTTP standard headers
re., partial results. After that, we might consider a SPARQL pragma for
results behavior.

Kingsley


best,
Andrea

Il giorno 30/mag/2013, alle ore 15:33, Kingsley Idehen
<kide...@openlinksw.com> ha scritto:

On 5/30/13 9:13 AM, Andrea Splendiani wrote:

Hi,

let me get back to this thread for two reasons.
1) I was wondering whether the report on DBPedia queries cited below
was already published.
2) I have recently tried to use DBPedia for some simple computation
and I have a problem. Basically a query for all cities whose
population is larger than that of the countries they are in returns
a random number of results. I suspect this is due to hitting some
internal computation load limits, and there is not much I can do
with limits, I think, as results are no more than 20 o so.

Now, I discovered this by chance. If this due to some limits, I
would much better prefer an error message (query too expensive) than
partial results.
Is there a way to detect that these results are partial ?

Of course, via the response headers of the SPARQL query:

1. X-SQL-State:
2. X-SQL-Message:

We are also looking at using HTTP a bit more here i.e., not returning
200 OK if the resultset is partial.

Otherwise, there is s full range of use cases that gets problematic.
I know dbpedia is a best effort free resource, so I understand the
need for limits, and unpredictable results are good enough for many
demos. But being unable to tell if a result is complete or not is a
big constraint in many applications

Also remember, as I've indicated repeatedly, you can get the same
data from the LOD cloud cache instance which is supported by more
powerful computing resources:

1. http://lod.openlinksw.com/sparql -- the fastest instance since its
a V7 cluster (even though it hosts 50 Billion+ triples)
2. http://dbpedia-live.openlinksw.com -- still faster than
dbpedia.org since its using V7 .


Kingsley

best,
Andrea


Il giorno 19/apr/2013, alle ore 02:54, Kingsley Idehen
<kide...@openlinksw.com> ha scritto:

On 4/18/13 7:06 PM, Andrea Splendiani wrote:

Il giorno 18/apr/2013, alle ore 16:04, Kingsley Idehen
<kide...@openlinksw.com> ha scritto:

On 4/18/13 9:23 AM, Andrea Splendiani wrote:

Hi,

I think that some caching with a minimum of query rewriting
would get read of 90% of the select{?s ?p ?o} where {?s?p ?o}
queries.

Sorta.
Client queries are inherently unpredictable. That's always been
the case, and that predates SPARQL. These issues also exist in
the SQL RDBMS realm, which is why you don't have SQL endpoints
delivering what SPARQL endpoints provide.

I know, but I suspect that these days lot of these "intensive"
queries are explorative, just to check what is in the dataset, and
may end up being very similar in structure.

Note, we have logs and recordings of queries that hit many of our
public endpoints. For instance, we are preparing a report on
DBpedia that will actually shed light on types and complexity of
queries that hit the DBpedia endpoint.

Jerven: can you report on your experience in this ? How much of
problematic queries are not really targeted, but more generic ?

 From a user perspective, I would rather have a clear result
code upfront telling me: your query is to heavy, not enough
resources and so on, than partial results + extra codes.

Yes, and you get that in some solutions e.g., what we provide.
Basically, our server (subject to capacity) will tell you
immediately that your query exceeds the query cost limits (this
is different from timeout limits). The aforementioned feature was
critical to getting the DBpedia SPARQL endpoint going, years ago.

Can you make a precise estimation of the query cost, or do you
rely on some heuristics ?

We have a query cost optimizer. It handles native and distributed
queries. Of course, query optimization is a universe onto itself,
but over the years we've continually applied what we've learned
about queries into its continued evolution.

I won't do much of partial results anyway... so it's time wasted
both sides.

Not in a world where you have a public endpoint and zero control
over the queries issued by clients.
Not in a world where you to provide faceted navigation over
entity relations as part of a "precision find" style service atop
RDF based Linked Data etc..

I mean, partial results are ok if I have control on which part it
is... a system-dependent random subset of results is not much
useful (not even for statistics!)

You have control with our system because you are basically given
the ability to retry using a heuristic that increases the total
processing time per retry, at the same time, while you are making
up your mind about whether retry or not, there are background
activities running in relation your last query. Remember, query
processing is comprised of many parts:

1. parsing
2. costing
3. solution
4. actual retrieval.

Many see 1-4 as a monolith. Not so when dealing with DBMS
processing. Again, this is novel, its quite old in the RDBMS realm.

One empiric solution could be to assign a quota per requesting
IP (or other form of identification).

That's but one coarse-grained factor. You need to be able to
associate a user agent (human or machine) profile with what ever
quality of service you seek to scope to said profile. Again, this
is the kind of thing we offer by leveraging WebID, Inference, and
RDF right inside the core DBMS engine.

I agree. The finer the better. The IP-based approach is perhaps
relatively easy to implement if not much is provided by the system.

  Then one could restrict the total amount of resource per
time-frame, possibly with smart policies.

"Smart Policies" are the kind of thing you produce by exploiting
the kind or entity relationship semantics baked into RDF based
Linked Data. Basically, OWL (which is all about describing entity
types and relation types semantics) serves this purpose very
well. We certainly put it to use in our data access policy system
which enables us to offer different capabilities and resource
consumption to different human- or machine-agent profiles.

How do you use OWL for this ?

We just have normal RDF graphs that describe data access policies.
All you need is a Linked Data URI that denotes an Agent (human or
machine) , agent profile oriented ontology (e.g., FOAF and the
like) that defines entity types and relation types, a Web access
control ontology, actual entity relations based on the
aforementioned ontologies, and reasoning capability.  Basically,
using RDF to do what it's actually designed to do.

It would also avoid people breaking big queries in many small
ones...

You can't avoid bad or challenging queries. What you can do is
look to fine-grained data access policies (semantically enhanced
ACLs) to address this problem. This has always been the
challenge, even before the emergence of the whole Semantic Web ,
RDF etc.. The same challenges also dogged the RDBMS realm. There
is no dancing around this matter when dealing with traditional
RDBMS or Web oriented data access.

But I was wondering: why is resource consumption a problem for
sparql endpoint providers, and not for other "providers" on the
web ? (say, YouTube, Google, ...).
Is it the unpredictability of the resources needed ?

Good question!

They hide the problem behind airport sized data centers, and then
they get you to foot the bill via your profile data which
ultimately compromises your privacy.

Isn't the same possible with sparql, in principle ?

Sorta.

The ultimate destination, in our opinion is this, which setup
provides the most cost-effective solution for Linked Data
exploitation, at Web-scale. Basically, how many machines do you
need to provide acceptable performance to a variety of user and
agent profiles. We don't believe you need an airport sized data
center to pull that off. The LOD cloud cache is just an 12-node
Virtuoso cluster split across 4 machines.

"OpenLink Virtuoso version 07.00.3202, on Linux
(x86_64-unknown-linux-gnu), Cluster Edition(12 server processes,
756 GB total memory)"

That's at the footer of the system home page:
http://lod.openlinksw.com. Likewise, we expose timing and resource
utilization data per query processed via that interface.

Although, I guess if a company would know that you spy on their
queries... there would be some issue (unlike for users and
facebook, for some reason).

It works like this:

1. we put out a public endpoint
2. we allow the public certain kinds of access e.g., like the
DBpedia fair use policy we have in place
3. we can provide special access to specific agents based on data
access policy graphs scoped to their WebIDs or other types of
identifiers.

Kingsley

best,
Andrea

This is a problem, and it's the ultimately basis for showcasing
what RDF (entity relationship based data model endowed with
*explicit* rather than *implicit* human- and machine-readable
entity relationship semantics) is actually all about.


Kingsley

best,
Andrea

Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman
<jerven.bolle...@isb-sib.ch> ha scritto:

Hi All,

Managing a public SPARQL endpoint has some difficulties in
comparison to managing a simpler REST api.
Instead of counting api calls or external bandwidth use we need
to look at internal IO and CPU usage as well.

Many of the current public SPARQL endpoints limit all their
users to queries of limited CPU time.
But this is not enough to really manage (mis) use of an
endpoint. Also the SPARQL api being http based
suffers from the problem that we first send the status code and
may only find out later that we can't
answer the query after all. Leading to a 200 not OK problem :(

What approaches can we come up with as a community to embedded
resource limit exceeded exceptions in the
SPARQL protocols. e.g. we could add an exception element to the
sparql xml result format.[1]

The current limits to CPU use are not enough to really avoid
misuse. Which is why I submitted a patch to
Sesame that allows limits on memory use as well. Although
limits on disk seeks or other IO counts may be needed by some
as well.

But these are currently hard limits what I really want is
"playground limits" i.e. you can use the swing as much as you
want if you are the only child in the park.
Once there are more children you have to share.

And how do we communicate this to our users. i.e. this result
set is incomplete because you exceeded your IO
quota please break up your queries in smaller blocks.

For my day job where I do manage a 7.4 billion triple store
with public access some extra tools in managing users would be
great.

Last but not least how can we avoid that users need to run
SELECT (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
For beta.sparql.uniprot.org I have been moving much of this
information into the sparql endpoint description but its not a
place
where people look for this information.

Regards,
Jerven

[1] Yeah these ideas are not great timing just after 1.1 but we
can always start SPARQL 1.2 ;)



-------------------------------------------------------------------

Jerven Bolleman                        jerven.bolle...@isb-sib.ch
SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379
58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.isb-sib.ch - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

--

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

--

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen


--

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

Re: Again on endpoint server limits [WAS Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.]

Reply via email to