[ https://issues.apache.org/jira/browse/JENA-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253687#comment-15253687 ]
ASF GitHub Bot commented on JENA-626: ------------------------------------- Github user osma commented on the pull request: https://github.com/apache/jena/pull/95#issuecomment-213367009 I got interested in how this would affect our Fuseki performance if it were merged. This became quite an experiment and I'm reporting the results here. First of all I needed a realistic dataset and sets of queries. I decided to use four of the SKOS datasets we serve at Finto.fi, together with SPARQL queries performed by the Skosmos application. I used the Skosmos [performance test suite](https://github.com/NatLibFi/Skosmos/tree/master/tests/performance) to generate a set of 36k SPARQL queries that do not involve the jena-text index (for simplicity), of which 22% are CONSTRUCT queries and the remainder are SELECTs. There are 24k unique SPARQL queries in that set and they should be a fairly realistic approximation of queries that we actually encounter. To give caching a chance, I considered each query twice and then took a random sample of 20k of the resulting 72k queries. This revised set has 14k unique queries, i.e. 30% of the queries are repetitions of previously seen queries. This is a rather difficult data set for caching, since there is so much variation - enough to completely fill the cache of 10000 query results. An ideal cache that stored everything and served requests with zero overhead would be able to obtain a 30% hit rate and a corresponding 30% decrease in average query time. I loaded the four SKOS datasets into named graphs of a TDB, as we usually do with Skosmos. I gave Fuseki 6GB of memory (`export JVM_ARGS=-Xmx6G`). I ran the tests on a Core i3-2330M laptop with 8GB RAM and an SSD running Ubuntu 14.04 64bit and OpenJDK 8. I tested four different Fuseki2 configurations: * vanilla: current 2.4.0-SNAPSHOT * jena-626: current snapshot after applying this PR * jena-626-1k: same as above, but reducing the cache size to 1000 * varnish: same as vanilla, but with a 256MB in-memory, gzip-compressing Varnish cache in front * jena-626-varnish: same as jena-626, but with Varnish cache as above I ran the 20000 queries using a test script that prints out status information after every 1000 queries. This includes the average query response time in the previous 1000 requests and the current Fuseki memory usage, as indicated by the RSS value in `top` command output. The results are in a [Google spreadsheet](https://docs.google.com/spreadsheets/d/1JY-vNW-iEmiGGVWl-D6It9lWHR65CkkdN2mQyWycWO8/edit?usp=sharing) and summarized in this diagram:  Thick lines indicate memory usage, thin lines response time. Unfortunately I couldn't get the diagram to show legends for all the setups - blue is vanilla, yellow is jena-626, purple is jena-626-1k, green is varnish, red is jena-626-varnish. One immediate finding is that Fuseki memory usage fluctuates a lot, so it is very difficult to measure accurately. I didn't have the patience to run this multiple times, so the values are very rough estimates. For comparison figures, I used the average memory usage of the 10 final measurements (i.e. when the 10000 item cache should be full, or nearly so). Response time measurements are much more consistent. Compared to vanilla, the other configurations have * jena-626: 22% better performance, memory usage +480MB * jena-626-1k: 9% better performance, memory usage +110MB * varnish: 30% better performance, memory usage unchanged (but Varnish used 100MB at the end) * jena-626-varnish: 27% better performance, memory usage +130MB (plus Varnish 100MB) Observations: * Long-term memory consumption may increase by hundreds of megabytes with this patch. **I think that the current cache size of 10000 items is too high.** A default size of 1000 would bring almost half the performance benefits with much lower memory usage. * Varnish is very close to the ideal cache here. It beats this query cache hands down. Also it appears that a 256MB cache using gzip compression could fit at least 30000 query results. * For setups which already use Varnish, this query cache only increases memory consumption, with no improvement in performance. **It should be possible to turn this cache off.** Other thoughts: * I'm quite often optimizing SPARQL queries, and when doing that I'm interested in worst-case (cold cache) performance. A cache like this would make this very dificult, as sometimes cached responses would be served very fast. I think that the cache should respect the `Pragma: no-cache` HTTP header as well as similar HTTP 1.1 `Cache-control` headers so that it is possible to avoid caching for specific queries. I put the scripts, queries and data in a [tarball](http://tester-os-kktest.lib.helsinki.fi/jena-626-performance.tar.gz) (100MB) in case someone wants to play around with the test suite. If the implementation of this cache is changed to e.g. store ResultSets instead of serializations, then the tests can be run again. > SPARQL Query Caching > -------------------- > > Key: JENA-626 > URL: https://issues.apache.org/jira/browse/JENA-626 > Project: Apache Jena > Issue Type: Improvement > Reporter: Andy Seaborne > Assignee: Saikat Maitra > Labels: java, linked_data, rdf, sparql > > Add a caching layer to Fuseki to cache the results of SPARQL Query requests. > This cache should allow for in-memory and disk-based caching, configuration > and cache management, and coordination with data modification. -- This message was sent by Atlassian JIRA (v6.3.4#6332)