[ https://issues.apache.org/jira/browse/JENA-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963141#comment-14963141 ]
ASF GitHub Bot commented on JENA-626: ------------------------------------- Github user afs commented on the pull request: https://github.com/apache/jena/pull/95#issuecomment-149183330 These comments focus on the architecture of the proposal. #### Capturing output ResponseResultSet/ResultSetFormatter. See also the discussion point below for a different approach to caching that is not based on capturing the written output. The current design requires changes to `ResultSetFormatter` and every output format. An alternative create a replicating `OutputStream` which can output to two places, one of which can be the string capture. This localises changes to Fuseki only. Sketch (there may be better ways to achieve the same effect). ``` class OutputStream2 extends OutputStream public OutputStream2(OutputStream out1, OutputStream out2) public void write(byte b[]) throws IOException { if ( out1 != null ) out1.write(b) ; if ( out2 != null ) out2.write(b) ; } ... ``` `ResponseResultSet.OutputContent` can use `OutputStream` (it does not need `ServletOutputStream`). ``` OutputStream outServlet = action.response.getOutputStream() ; OutputStream out ; if { writing to cache ) { ByteArrayOutputStream outCatcher = new ByteArrayOutputStream() ; out = new OutputStream2(outServlet, outCatcher) ; } else { out = outServlet ; } ... proc.output(out) ; ``` This will work if a new format is added (a Thrift based binary format for example) without needing the format to be aware of cache entry creation. It also means the caching is not exposed in the ARQ API. #### Caching and content negotiation The Cache key is insensitive to the "Accept" header. The format of the output is determined by the "Accept" header. The query string `output=` is merely a non-standard way to achieve the same thing when it is hard to set the HTTP header (some lightweight scripting libraries). The current design writes the same format as the request, and uses the cache but these two operations are different: ``` GET /datasets/query=SELECT * { ?s ?p ?o} Accept: application/sparql-results+xml ``` ``` GET /datasets/query=SELECT * { ?s ?p ?o} Accept: application/sparql-results+json ``` #### Cache `ResultSet` (Discussion point) A possibility is that the cache is of a copy of the ResultSet (as java objects). Advantages: * cached item is not in a particular format. Content negotiation happens per request. * OFFSET/LIMIT can be applied to the cached results f the original query is executed without OFFSET/LIMIT (weak version of paging). See the experimental, sketch-only and out of date [sparql-cache](https://svn.apache.org/repos/asf/jena/Experimental/sparql-cache/) for OFFSET/LIMIT processing. Disadvantages: * Does not stream * cache entries are rewritten each time they are used. An iterator over the result set that captures the output while iterating would address the non-streaming disadvantage. Content negotiation happens per request. #### Cache invalidation Update operations must invalid the cache. A simple way is to simply invalidate the whole cache. It is very hard to determine whether an update affects a cache entry selectively. #### Configuration and Control The cache is hard wired and always on. It may not always be the right choice. There needs to be a way to control it, possibly on a per-dataset basis. Note there is only one SPARQL_Query instance per Fuseki server due to the way dispatch is dynamic. Suggestion 1: A servlet `SPARQL_Query_Cache` catches passes requests to a separate `SPARQL_Query`. This is the inheritance way to separate cache code from the rest of processing. It works better if OFFSET/LIMIT control is going to be added later. Suggestion 2: It is primarily `SPARQL_Query::sendResults` being caught here so a "result handler" set in `SPARQL_Query` would allow a separation of cache code into it's own class and just a hook in `SPARQL_Query`. This is the composition way to separate the cache code from the rest of processing. #### Reuse the Guava already in Jena Extend CacheGuava to have a constructor that takes a `org.apache.jena.ext.com.google.common.cache.CacheBuilder`. Background: Jena includes a shared Guava 18 in `org.apache.jena.ext.com.google.common` in artifact `jena-shaded-guava`. (Apparently Hadoop, the main reason we shade Guava does now in fact work with later Guava's than its dependency on an old Guava version.) ### Extend to CONSTRUCT and DESCRIBE (future) It is only for SELECT and ASK queries. See `ResponseModel` and `ResponseDataset` for CONSTRUCT and DESCRIBE. The output capture point shoudl make this possible. #### Documentation and tests Documentation and tests needed. > SPARQL Query Caching > -------------------- > > Key: JENA-626 > URL: https://issues.apache.org/jira/browse/JENA-626 > Project: Apache Jena > Issue Type: Improvement > Reporter: Andy Seaborne > Labels: java, linked_data, rdf, sparql > > Add a caching layer to Fuseki to cache the results of SPARQL Query requests. > This cache should allow for in-memory and disk-based caching, configuration > and cache management, and coordination with data modification. -- This message was sent by Atlassian JIRA (v6.3.4#6332)