[jira] [Commented] (JENA-626) SPARQL Query Caching

ASF GitHub Bot (JIRA) Mon, 19 Oct 2015 03:51:13 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963141#comment-14963141
 ]


ASF GitHub Bot commented on JENA-626:
-------------------------------------

Github user afs commented on the pull request:

    https://github.com/apache/jena/pull/95#issuecomment-149183330
  
    These comments focus on the architecture of the proposal.
    
    #### Capturing output
    ResponseResultSet/ResultSetFormatter.
    
    See also the discussion point below for a different approach to caching 
that is not based on capturing the written output.
    
    The current design requires changes to `ResultSetFormatter` and every 
output format. 
    
    An alternative create a replicating `OutputStream` which can output to two 
places, one of which can be the string capture.  This localises changes to 
Fuseki only.
    
    Sketch (there may be better ways to achieve the same effect).
    ```
    class OutputStream2 extends OutputStream
    public OutputStream2(OutputStream out1, OutputStream out2)
    public void write(byte b[]) throws IOException {
            if ( out1 != null ) out1.write(b) ;
            if ( out2 != null ) out2.write(b) ;
        }
        ...
    ```
    `ResponseResultSet.OutputContent` can use `OutputStream` (it does not need 
`ServletOutputStream`).
    ```
        OutputStream outServlet = action.response.getOutputStream() ;
        OutputStream out ;
        if { writing to cache ) {
            ByteArrayOutputStream outCatcher = new ByteArrayOutputStream() ;
            out = new OutputStream2(outServlet, outCatcher) ;
        } else {
            out = outServlet ;
        }
        ...
        proc.output(out) ;
    ```
    This will work if a new format is added (a Thrift based binary format for 
example) without needing the format to be aware of cache entry creation.
    
    It also means the caching is not exposed in the ARQ API.
    
    #### Caching and content negotiation
    The Cache key is insensitive to the "Accept" header.
    
    The format of the output is determined by the "Accept" header.  The query 
string `output=` is merely a non-standard way to achieve the same thing when it 
is hard to set the HTTP header (some lightweight scripting libraries).
    
    The current design writes the same format as the request, and uses the 
cache but these two operations are different:
    ```
    GET /datasets/query=SELECT * { ?s ?p ?o}
    Accept: application/sparql-results+xml
    ```
    ```
    GET /datasets/query=SELECT * { ?s ?p ?o}
    Accept: application/sparql-results+json
    ```
    
    #### Cache `ResultSet`
    (Discussion point) A possibility is that the cache is of a copy of the 
ResultSet (as java objects).
    
    Advantages:
    * cached item is not in a particular format. Content negotiation happens 
per request.
    * OFFSET/LIMIT can be applied to the cached results f the original query is 
executed without OFFSET/LIMIT (weak version of paging). 
    
    See the experimental, sketch-only and out of date 
[sparql-cache](https://svn.apache.org/repos/asf/jena/Experimental/sparql-cache/)
 for OFFSET/LIMIT processing.
    
    Disadvantages:
    * Does not stream
    * cache entries are rewritten each time they are used. 
    
    An iterator over the result set that captures the output while iterating 
would address the non-streaming disadvantage.
    
    Content negotiation happens per request.
    
    #### Cache invalidation
    Update operations must invalid the cache. A simple way is to simply 
invalidate the whole cache.  It is very hard to determine whether an update 
affects a cache entry selectively.
    
    #### Configuration and Control 
    The cache is hard wired and always on.  It may not always be the right 
choice.  There needs to be a way to control it, possibly on a per-dataset 
basis.  Note there is only one SPARQL_Query instance per Fuseki server due to 
the way dispatch is dynamic.
    
    Suggestion 1: A servlet `SPARQL_Query_Cache` catches passes requests to a 
separate `SPARQL_Query`.  This is the inheritance way to separate cache code 
from the rest of processing.  It works better if OFFSET/LIMIT control is going 
to be added later.
    
    Suggestion 2: It is primarily `SPARQL_Query::sendResults` being caught here 
so a "result handler" set in `SPARQL_Query` would allow a separation of cache 
code into it's own class and just a hook in `SPARQL_Query`. This is the 
composition way to separate the cache code from the rest of processing.
    
    #### Reuse the Guava already in Jena  
    Extend CacheGuava to have a constructor that takes a 
`org.apache.jena.ext.com.google.common.cache.CacheBuilder`.
    
    Background:
    Jena includes a shared Guava 18 in `org.apache.jena.ext.com.google.common` 
in artifact `jena-shaded-guava`.  (Apparently Hadoop, the main reason we shade 
Guava does now in fact work with later Guava's than its dependency on an old 
Guava version.)
    
    ### Extend to CONSTRUCT and DESCRIBE (future)
    It is only for SELECT and ASK queries.  See `ResponseModel` and 
`ResponseDataset` for CONSTRUCT and DESCRIBE.  The output capture point shoudl 
make this possible.
    
    #### Documentation and tests
    Documentation and tests needed.



> SPARQL Query Caching
> --------------------
>
>                 Key: JENA-626
>                 URL: https://issues.apache.org/jira/browse/JENA-626
>             Project: Apache Jena
>          Issue Type: Improvement
>            Reporter: Andy Seaborne
>              Labels: java, linked_data, rdf, sparql
>
> Add a caching layer to Fuseki to cache the results of SPARQL Query requests.  
> This cache should allow for in-memory and disk-based caching, configuration 
> and cache management, and coordination with data modification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JENA-626) SPARQL Query Caching

Reply via email to