[
https://issues.apache.org/jira/browse/JENA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paolo Castagna deleted JENA-140:
--------------------------------
> LD-Access - A caching layer for SPARQL endpoints
> ------------------------------------------------
>
> Key: JENA-140
> URL: https://issues.apache.org/jira/browse/JENA-140
> Project: Jena
> Issue Type: Wish
> Environment: It will work with any SPARQL endpoint (local or remote
> and different implementations)
> Reporter: Paolo Castagna
> Labels: caching, performaces, sparql
> Original Estimate: 720h
> Remaining Estimate: 720h
>
> (the content of this description is taken from Andy's message on jena-dev
> with minor editing: http://markmail.org/message/p5x334m7dy676oik)
> The starting point is have an intercepting SPARQL endpoint for another
> endpoint. The application uses the cache URL for SPARQL queries.
> It's not supposed to be a big project - it might be a servlet SPARQL endpoint
> and/or yet-another query engine implementation.
> 1/ Same query.
> The most obvious is repeating the same query. It's sometimes surprising just
> how often a query is repeated, across users (e.g same starting point of an
> app), but even by the same user. Having a close cache is noticeably faster
> than going to a remote endpoint.
> HTTP caching also catches this, or it would. A GET with query string isn't
> cached by squid apparently.
> 2/ Convert formats
> Given a proxy that is looking at the request, the format of the response can
> be converted between formats, so convert from SPARQL XML to SPARQL JSON for
> example.
> 3/ Paging.
> The idiom of a sequence of SELECT / ORDER BY / OFFSET / LIMIT calls with
> changes in OFFSET to get different slices of a result set happens in linked
> data apps (and others).
> We've been optimizing these in ARQ using "top N" queries but LD-Access can
> offer facilities at a different granularity. Catch that query, issue the
> full SELECT / ORDER BY query, cache the results. Then you can slice the
> results as pages without going back to the server.
> One side effect of this is paging without sorting, another is moving sorting
> away from the origin server.
> Sorting is expensive but it's needed to guarantee stability of the result set
> being sliced into pages. So issue the query as SELECT and either sort
> locally (you get to choose the resources available), to get the same sorted
> pageable results. Or if ordering is only for stability, just remove the
> ORDER by and replace with a promise to slice from an unchanging result set.
> 4/ Add reliability/predictability.
> Defend the app from bad data - always get the entire results back to check
> they will all be available before responding to the client.
> Or add query timeouts.
> Or fix formats if it isn't quite correct.
> 5/ Intermittent endpoints.
> It's hard to run a public endpoint on the open web. dbpedia is not always
> up, and if it's up, then it's busy because of other requests.
> dbpedia has a (necessary) defensive query execution timeout - it is easier to
> get a query to run late at (Amercian) night than European afternoon. Why not
> issue the queries for resources you want to track in a batch script and pick
> up the results during the day? Doesn't work for all situations but it can be
> useful.
> 6/ Resource caching.
> 1-5 are about SPARQL queries, mainly SELECT. What about caching data about
> resources (all triples with the same subject)?
> Break up a DESCRIBE query into pattern and resources, issue the pattern, see
> what resources it will describe and only get ones not cached. This might be
> a loss as it is a double round tripe.
> 6/ Not SPARQL at all.
> This gets into a very different kind of server. It caches information (and
> here "cache" may be "publish") as information about things named by URI, e.g.
> all triples with the same subject.
> Access is plain GET ?subject=<uri> -- it's a key-value store or document
> store providing SPARQL. It will scale; it can use any one of the NoSQL
> KeyValue stores out there.
> Add secondary indexes - e.g. a Lucene index. The index is simply a way to
> ask a question and get a list of URIs. The URIs are accessed and all the RDF
> sent back to the requester. How the index gets formed is not defined.
> Or a geospatial index - get information about all things in a bounding box.
> You can see this is like various NoSQL-ish things out there, and in teh
> spirit of OData/GData -- this is "RData".
> For Fuseki+TDB, I'd like to get to support for conditional GETs using
> transactions to generate a new eTag.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira