[ 
https://issues.apache.org/jira/browse/JENA-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253687#comment-15253687
 ] 

ASF GitHub Bot commented on JENA-626:
-------------------------------------

Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/95#issuecomment-213367009
  
    I got interested in how this would affect our Fuseki performance if it were 
merged. This became quite an experiment and I'm reporting the results here.
    
    First of all I needed a realistic dataset and sets of queries. I decided to 
use four of the SKOS datasets we serve at Finto.fi, together with SPARQL 
queries performed by the Skosmos application. I used the Skosmos [performance 
test suite](https://github.com/NatLibFi/Skosmos/tree/master/tests/performance) 
to generate a set of 36k SPARQL queries that do not involve the jena-text index 
(for simplicity), of which 22% are CONSTRUCT queries and the remainder are 
SELECTs. There are 24k unique SPARQL queries in that set and they should be a 
fairly realistic approximation of queries that we actually encounter.
    
    To give caching a chance, I considered each query twice and then took a 
random sample of 20k of the resulting 72k queries. This revised set has 14k 
unique queries, i.e. 30% of the queries are repetitions of previously seen 
queries. This is a rather difficult data set for caching, since there is so 
much variation - enough to completely fill the cache of 10000 query results. An 
ideal cache that stored everything and served requests with zero overhead would 
be able to obtain a 30% hit rate and a corresponding 30% decrease in average 
query time.
    
    I loaded the four SKOS datasets into named graphs of a TDB, as we usually 
do with Skosmos. I gave Fuseki 6GB of memory (`export JVM_ARGS=-Xmx6G`). I ran 
the tests on a Core i3-2330M laptop with 8GB RAM and an SSD running Ubuntu 
14.04 64bit and OpenJDK 8. I tested four different Fuseki2 configurations:
    
    * vanilla: current 2.4.0-SNAPSHOT
    * jena-626: current snapshot after applying this PR
    * jena-626-1k: same as above, but reducing the cache size to 1000
    * varnish: same as vanilla, but with a 256MB in-memory, gzip-compressing 
Varnish cache in front
    * jena-626-varnish: same as jena-626, but with Varnish cache as above
    
    I ran the 20000 queries using a test script that prints out status 
information after every 1000 queries. This includes the average query response 
time in the previous 1000 requests and the current Fuseki memory usage, as 
indicated by the RSS value in `top` command output.
    
    The results are in a [Google 
spreadsheet](https://docs.google.com/spreadsheets/d/1JY-vNW-iEmiGGVWl-D6It9lWHR65CkkdN2mQyWycWO8/edit?usp=sharing)
 and summarized in this diagram:
    
![chart](https://cloud.githubusercontent.com/assets/1132830/14738105/76d9ae42-0888-11e6-8920-b7c81428d053.png)
    Thick lines indicate memory usage, thin lines response time. Unfortunately 
I couldn't get the diagram to show legends for all the setups - blue is 
vanilla, yellow is jena-626, purple is jena-626-1k, green is varnish, red is 
jena-626-varnish.
    
    One immediate finding is that Fuseki memory usage fluctuates a lot, so it 
is very difficult to measure accurately. I didn't have the patience to run this 
multiple times, so the values are very rough estimates. For comparison figures, 
I used the average memory usage of the 10 final measurements (i.e. when the 
10000 item cache should be full, or nearly so). Response time measurements are 
much more consistent.
    
    Compared to vanilla, the other configurations have
    * jena-626: 22% better performance, memory usage +480MB
    * jena-626-1k: 9% better performance, memory usage +110MB
    * varnish: 30% better performance, memory usage unchanged (but Varnish used 
100MB at the end)
    * jena-626-varnish: 27% better performance, memory usage +130MB (plus 
Varnish 100MB)
    
    Observations:
    * Long-term memory consumption may increase by hundreds of megabytes with 
this patch. **I think that the current cache size of 10000 items is too high.** 
A default size of 1000 would bring almost half the performance benefits with 
much lower memory usage.
    * Varnish is very close to the ideal cache here. It beats this query cache 
hands down. Also it appears that a 256MB cache using gzip compression could fit 
at least 30000 query results.
    * For setups which already use Varnish, this query cache only increases 
memory consumption, with no improvement in performance. **It should be possible 
to turn this cache off.**
    
    Other thoughts:
    * I'm quite often optimizing SPARQL queries, and when doing that I'm 
interested in worst-case (cold cache) performance. A cache like this would make 
this very dificult, as sometimes cached responses would be served very fast. I 
think that the cache should respect the `Pragma: no-cache` HTTP header as well 
as similar HTTP 1.1 `Cache-control` headers so that it is possible to avoid 
caching for specific queries.
    
    I put the scripts, queries and data in a 
[tarball](http://tester-os-kktest.lib.helsinki.fi/jena-626-performance.tar.gz) 
(100MB) in case someone wants to play around with the test suite. If the 
implementation of this cache is changed to e.g. store ResultSets instead of 
serializations, then the tests can be run again.


> SPARQL Query Caching
> --------------------
>
>                 Key: JENA-626
>                 URL: https://issues.apache.org/jira/browse/JENA-626
>             Project: Apache Jena
>          Issue Type: Improvement
>            Reporter: Andy Seaborne
>            Assignee: Saikat Maitra
>              Labels: java, linked_data, rdf, sparql
>
> Add a caching layer to Fuseki to cache the results of SPARQL Query requests.  
> This cache should allow for in-memory and disk-based caching, configuration 
> and cache management, and coordination with data modification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to