[
https://issues.apache.org/jira/browse/JENA-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254157#comment-15254157
]
ASF GitHub Bot commented on JENA-626:
-------------------------------------
Github user afs commented on the pull request:
https://github.com/apache/jena/pull/95#issuecomment-213496231
**Intent and Abstraction**
Fuseki-caching isn't going to beat Vanish so I think the better use of
Fuseki-caching is supporting cases flexibility like (in the future) paging
results.
I've sketched something in a branch in a work area:
https://github.com/afs/jena/tree/fuseki-cache
This is a sketch and not for serious use - there is some quick-and-easy
implementation, it's only lightly tested for one dataset only. No
configurability.
The classes changed are: `HttpAction`, `ResultsCache`, and `SPARQL_Query`
to use the cache. `SPARQL_Query` has operations `processViaCache`,
`prepareForCache`, `insertIntoCache`. It deals with two concurrent attempts to
set the by letting them both run (it's the same answer right?!) and set the
cache.
The cache is invalided when `HttpAction.beginWrite` is called so all update
routes are caught (SPARQL Update, GSP and the Uploader). I don't like that - it
seems asymmetric that `beginWrite` is used and it assumes MR+SW.
Cache actions are logged `** Cache`.
**Space**
If the query result (not the serialization) is stored, I would expect the
memory footprint will be less because of sharing nodes with the original
dataset. Any graph pattern matching variable ends up with the
node-by-reference. Calculated expressions are fresh nodes. Long literals are
shared.
Literals from the data are not extra cost in memory. Let's assume that
calculated nodes are small. This is usually true - but they may be a lot of
them.
The calculation of the memory cost, is now approximated by the total number
of cells in the results, i.e approximate with "num of rows * num of columns"
and it can be calculated while capturing the `ResutlSet` copy. We could put
limits on the size of results sets cached and on total number of cells.
Serialized results can easily sized. They do not share space though.
**Configuration**
We need some configuration control, both server-wide on the `fuseki:Server`
object in config.ttl and on each service. Or use "Context" - caching is import
so my suggestion is have properties to cl;early set values.
The server-wide case is, I think, less important. I suggest putting the
configuration on service, not the dataset, so you can have two different
policies, like cached and not-cached, on the same data.
The default should be "no caching".
The having two services addresses the "cold cache/development" use case.
We should still obey `Pragma: no-cache` and `Cache-control` but there are
quite a lot of options and details so it might be wise to not aim to have
everything for a first release, especially if caching is default off.
#### Other
Related-but-different observation: supporting conditional-GETs would be
very good. Just keep an epoxy number/timestamp for each dataset.
> SPARQL Query Caching
> --------------------
>
> Key: JENA-626
> URL: https://issues.apache.org/jira/browse/JENA-626
> Project: Apache Jena
> Issue Type: Improvement
> Reporter: Andy Seaborne
> Assignee: Saikat Maitra
> Labels: java, linked_data, rdf, sparql
>
> Add a caching layer to Fuseki to cache the results of SPARQL Query requests.
> This cache should allow for in-memory and disk-based caching, configuration
> and cache management, and coordination with data modification.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)