[jira] [Deleted] (JENA-140) LD-Access - A caching layer for SPARQL endpoints

Paolo Castagna (Deleted) (JIRA) Fri, 14 Oct 2011 07:28:33 -0700

     [ 
https://issues.apache.org/jira/browse/JENA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paolo Castagna deleted JENA-140:
--------------------------------

    
> LD-Access - A caching layer for SPARQL endpoints
> ------------------------------------------------
>
>                 Key: JENA-140
>                 URL: https://issues.apache.org/jira/browse/JENA-140
>             Project: Jena
>          Issue Type: Wish
>         Environment: It will work with any SPARQL endpoint (local or remote 
> and different implementations)
>            Reporter: Paolo Castagna
>              Labels: caching, performaces, sparql
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> (the content of this description is taken from Andy's message on jena-dev 
> with minor editing: http://markmail.org/message/p5x334m7dy676oik)
> The starting point is have an intercepting SPARQL endpoint for another 
> endpoint.  The application uses the cache URL for SPARQL queries.
> It's not supposed to be a big project - it might be a servlet SPARQL endpoint 
> and/or yet-another query engine implementation.
> 1/ Same query.
> The most obvious is repeating the same query.  It's sometimes surprising just 
> how often a query is repeated, across users (e.g same starting point of an 
> app), but even by the same user.  Having a close cache is noticeably faster 
> than going to a remote endpoint.
> HTTP caching also catches this, or it would. A GET with query string isn't 
> cached by squid apparently.
> 2/ Convert formats
> Given a proxy that is looking at the request, the format of the response can 
> be converted between formats, so convert from SPARQL XML to SPARQL JSON for 
> example.
> 3/ Paging.
> The idiom of a sequence of SELECT / ORDER BY / OFFSET / LIMIT calls with 
> changes in OFFSET to get different slices of a result set happens in linked 
> data apps (and others).
> We've been optimizing these in ARQ using "top N" queries but LD-Access can 
> offer facilities at a different granularity.  Catch that query, issue the 
> full SELECT / ORDER BY query, cache the results.  Then you can slice the 
> results as pages without going back to the server.
> One side effect of this is paging without sorting, another is moving sorting 
> away from the origin server.
> Sorting is expensive but it's needed to guarantee stability of the result set 
> being sliced into pages.  So issue the query as SELECT and either sort 
> locally (you get to choose the resources available), to get the same sorted 
> pageable results.  Or if ordering is only for stability, just remove the 
> ORDER by and replace with a promise to slice from an unchanging result set.
> 4/ Add reliability/predictability.
> Defend the app from bad data - always get the entire results back to check 
> they will all be available before responding to the client.
> Or add query timeouts.
> Or fix formats if it isn't quite correct.
> 5/ Intermittent endpoints.
> It's hard to run a public endpoint on the open web.  dbpedia is not always 
> up, and if it's up, then it's busy because of other requests.
> dbpedia has a (necessary) defensive query execution timeout - it is easier to 
> get a query to run late at (Amercian) night than European afternoon.  Why not 
> issue the queries for resources you want to track in a batch script and pick 
> up the results during the day?  Doesn't work for all situations but it can be 
> useful.
> 6/ Resource caching.
> 1-5 are about SPARQL queries, mainly SELECT.  What about caching data about 
> resources (all triples with the same subject)?
> Break up a DESCRIBE query into pattern and resources, issue the pattern, see 
> what resources it will describe and only get ones not cached.  This might be 
> a loss as it is a double round tripe.
> 6/ Not SPARQL at all.
> This gets into a very different kind of server.  It caches information (and 
> here "cache" may be "publish") as information about things named by URI, e.g. 
> all triples with the same subject.
> Access is plain GET ?subject=<uri> -- it's a key-value store or document 
> store providing SPARQL.  It will scale; it can use any one of the NoSQL 
> KeyValue stores out there.
> Add secondary indexes - e.g. a Lucene index.  The index is simply a way to 
> ask a question and get a list of URIs.  The URIs are accessed and all the RDF 
> sent back to the requester.  How the index gets formed is not defined.
> Or a geospatial index - get information about all things in a bounding box.
> You can see this is like various NoSQL-ish things out there, and in teh 
> spirit of OData/GData -- this is "RData".
> For Fuseki+TDB, I'd like to get to support for conditional GETs using 
> transactions to generate a new eTag.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Deleted] (JENA-140) LD-Access - A caching layer for SPARQL endpoints

Reply via email to