[jira] [Created] (JENA-140) LD-Access - A caching layer for SPARQL endpoints

Paolo Castagna (Created) (JIRA) Wed, 12 Oct 2011 09:59:38 -0700

LD-Access - A caching layer for SPARQL endpoints
------------------------------------------------


                 Key: JENA-140
                 URL: https://issues.apache.org/jira/browse/JENA-140
             Project: Jena
          Issue Type: Wish
         Environment: It will work with any SPARQL endpoint (local or remote 
and different implementations)
            Reporter: Paolo Castagna


(the content of this description is taken from Andy's message on jena-dev with 
minor editing: http://markmail.org/message/p5x334m7dy676oik)

The starting point is have an intercepting SPARQL endpoint for another 
endpoint.  The application uses the cache URL for SPARQL queries.

It's not supposed to be a big project - it might be a servlet SPARQL endpoint 
and/or yet-another query engine implementation.

1/ Same query.

The most obvious is repeating the same query.  It's sometimes surprising just 
how often a query is repeated, across users (e.g same starting point of an 
app), but even by the same user.  Having a close cache is noticeably faster 
than going to a remote endpoint.

HTTP caching also catches this, or it would. A GET with query string isn't 
cached by squid apparently.

2/ Convert formats

Given a proxy that is looking at the request, the format of the response can be 
converted between formats, so convert from SPARQL XML to SPARQL JSON for 
example.

3/ Paging.

The idiom of a sequence of SELECT / ORDER BY / OFFSET / LIMIT calls with 
changes in OFFSET to get different slices of a result set happens in linked 
data apps (and others).

We've been optimizing these in ARQ using "top N" queries but LD-Access can 
offer facilities at a different granularity.  Catch that query, issue the full 
SELECT / ORDER BY query, cache the results.  Then you can slice the results as 
pages without going back to the server.

One side effect of this is paging without sorting, another is moving sorting 
away from the origin server.

Sorting is expensive but it's needed to guarantee stability of the result set 
being sliced into pages.  So issue the query as SELECT and either sort locally 
(you get to choose the resources available), to get the same sorted pageable 
results.  Or if ordering is only for stability, just remove the ORDER by and 
replace with a promise to slice from an unchanging result set.

4/ Add reliability/predictability.

Defend the app from bad data - always get the entire results back to check they 
will all be available before responding to the client.

Or add query timeouts.

Or fix formats if it isn't quite correct.

5/ Intermittent endpoints.

It's hard to run a public endpoint on the open web.  dbpedia is not always up, 
and if it's up, then it's busy because of other requests.

dbpedia has a (necessary) defensive query execution timeout - it is easier to 
get a query to run late at (Amercian) night than European afternoon.  Why not 
issue the queries for resources you want to track in a batch script and pick up 
the results during the day?  Doesn't work for all situations but it can be 
useful.

6/ Resource caching.

1-5 are about SPARQL queries, mainly SELECT.  What about caching data about 
resources (all triples with the same subject)?

Break up a DESCRIBE query into pattern and resources, issue the pattern, see 
what resources it will describe and only get ones not cached.  This might be a 
loss as it is a double round tripe.

6/ Not SPARQL at all.

This gets into a very different kind of server.  It caches information (and 
here "cache" may be "publish") as information about things named by URI, e.g. 
all triples with the same subject.

Access is plain GET ?subject=<uri> -- it's a key-value store or document store 
providing SPARQL.  It will scale; it can use any one of the NoSQL KeyValue 
stores out there.

Add secondary indexes - e.g. a Lucene index.  The index is simply a way to ask 
a question and get a list of URIs.  The URIs are accessed and all the RDF sent 
back to the requester.  How the index gets formed is not defined.

Or a geospatial index - get information about all things in a bounding box.

You can see this is like various NoSQL-ish things out there, and in teh spirit 
of OData/GData -- this is "RData".

For Fuseki+TDB, I'd like to get to support for conditional GETs using 
transactions to generate a new eTag.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (JENA-140) LD-Access - A caching layer for SPARQL endpoints

Reply via email to