Re: Performance regressions in Jena and TDB2

Andy Seaborne Mon, 30 Nov 2020 12:50:02 -0800

Hi Osma,

This is misused of FROM.

Using FROM is not as simple as just selection one graph from thedataset. It is creating a complete new dataset comprised of graph pickedout of a pool of graph in the TDB database.


The app should use GRAPH if that is what it means!

FROM is more.
FROM can be several graphs

SELECT *
FROM <g1>
FROM <g2>
WHERE ..

meaning that the default graph is the union view of g1 and g2. And alsothat there are no named graphs (no use of FROM NAMED) so using GRAPH inthe query will see nothing.

As to why there is a big difference - I don't know but the right thingis not to use FROM unless you want the union and hiding effects. It'sabout features not performance (the low level optimizer is beingpartially blocked - no quads).

TDB2 is not about being faster than TDB1 - it's about better scale oftransactions, more robust and better handling of inline datatypes.The TDB2 bulk loader is faster ... with enough hardware much faster(--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster formultibillions datasets.


Query-wise they are comparable.

There are two possible effects:

If you did a straight bulk load of the two datasets then TDB1 is goingto be a little faster because the dataset has a slightly better layoutinitially. However, if any updates happen, the difference will fade -it's also a reflection of the data base size (470Mi) and OS caching.

It would be interesting to see the query figures if you loaded astriples - the triples table is smaller than the quads table with one graph.

Another effect is that DB2 has a better internal design with moreabstraction and classes. It may take more time for the JIT to fullyoptimize.

Likewise, this makes me wonder whether there has been a mild decrease in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at intermediate versions to pinpoint the exact change (or several) that would be causing the slowdown. If there's interest, I can try other versions as well.


Yes please. That would be useful to know.

    Andy

PS 1991-27-59 is not a date :-)


On 30/11/2020 15:34, Osma Suominen wrote:

Hello,
We're in the process of replacing an old server that was still runningFuseki1 from Jena 3.8.0 with a TDB1 store. The new server has Fuseki2from Jena 3.16.0 and a TDB2 store.
While testing the new server, I noticed that the new Fuseki is running aparticular SPARQL query much slower than the old one. This is a queryperformed by Skosmos to find out all the letters of the alphabet for analphabetical index by looking at all the skos:prefLabel values in aspecific language. It's expected to be a bit slow (several seconds)since it needs to look at all the labels - but on the new server, thequery is almost an order of magnitude slower, which is causing timeoutissues.
To investigate this more closely, I decided to drop Fuseki out of theequation and just use Jena command line utilities. I wanted to comparethe effect of Jena versions (3.8.0 vs 3.16.0), store type (TDB1 vsTDB2), and variations of the original SPARQL query. For the data, I usedthe newly published KANTO/finaf data set (an authority file of namedentities, i.e. persons and organizations) which can be downloaded fromfinto.fi [1]. It has around 3M triples, 200k skos:Concept instances andthe same number of skos:prefLabel values.
I loaded this into a TDB1 data set using Jena 3.7.0 (because ofJENA-1575) like this:
apache-jena-3.7.0/bin/tdbloader --loc tdb1 --graphhttp://example.org/finaf finaf-skos.ttl
Likewise, I loaded the same data set into a TDB2 store using Jena 3.16.0:
apache-jena-3.16.0/bin/tdb2.tdbloader --loc tdb2 --graphhttp://example.org/finaf finaf-skos.ttl
This is the original SPARQL query:

     PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

     SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
     FROM <http://example.org/finaf>
     WHERE {
       ?c a skos:Concept .
       ?c skos:prefLabel ?label .
       FILTER(langMatches(lang(?label), 'fi'))
     }
The query should return 68 results. There is no particular order sincethere is no ORDER BY, they are just a set of letters and specialcharacters such as numbers and punctuation.
I ran it using tdbquery / tdb2.tdbquery, separately on both Jena 3.8.0and 3.16.0, using the options --time --repeat 2,10 (benchmark; tworounds of warming up, ten rounds of benchmarking) and wrote down theaverage query time, rounded to the first decimal point. I'm doing thebenchmarks on an i5-7200U laptop with a pretty fast SSD.
Jena 3.8.0 / TDB1: 2.1s
Jena 3.16.0 / TDB1: 2.4s
Jena 3.8.0 / TDB2: 11.8s
Jena 3.16.0 / TDB2: 12.0s
The difference between Jena versions is not very significant, but TDB2is 5-6 times slower than TDB1. Here is how tdbquery -v explains thequery on the TDB level:
17:06:07 INFO  exec            :: TDB
   (distinct
     (project (?l)
       (extend ((?l (ucase (str (substr ?label 1 1)))))
         (filter (langMatches (lang ?label) "fi")
           (bgp
(triple ?c<http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/2004/02/skos/core#Concept>) (triple ?c <http://www.w3.org/2004/02/skos/core#prefLabel>?label)
           )))))

The explanation is identical for tdb2.tdbquery so I won't repeat it.
I then looked at ways of optimizing the query to make it perform better.After trying many variations (for example reordering the clauses and/ormoving the substring expression to a BIND variable), the only changethat seemed to have a significant effect was to remove the FROM clauseand instead insert a GRAPH clause targeting the same graph, like this:
     PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

     SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
     WHERE {
       GRAPH <http://example.org/finaf> {
         ?c a skos:Concept .
         ?c skos:prefLabel ?label .
         FILTER(langMatches(lang(?label), 'fi'))
       }
     }

Benchmark results for this GRAPH version of the query:

Jena 3.8.0 / TDB1: 0.9s
Jena 3.16.0 / TDB1: 1.3s
Jena 3.8.0 / TDB2: 1.4s
Jena 3.16.0 / TDB2: 1.9s
The results are much more even this time, though Jena 3.16.0 is about40% slower than 3.8.0 and TDB2 is about 50% slower than TDB1. tdbquery-v (and tdb2.tdbquery -v) explains the query like this:
17:13:02 INFO  exec            :: TDB
   (distinct
     (project (?l)
       (extend ((?l (ucase (str (substr ?label 1 1)))))
         (filter (langMatches (lang ?label) "fi")
           (quadpattern
(quad <http://example.org/finaf> ?c<http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/2004/02/skos/core#Concept>) (quad <http://example.org/finaf> ?c<http://www.w3.org/2004/02/skos/core#prefLabel> ?label)
           )))))
The difference I see compared to the previous query is the use of "quad"instead of "triple". My understanding of operations on the TDB level ispretty naive, but it seems to me this is now targeting the correct graphdirectly, instead of indirectly, as in the first case. This is a bitsurprising to me since the "FROM <http://example.org/finaf>" clause inthe first query is, to me, saying the same thing as the GRAPH clause:just target triples in this particular graph. Is there a missedopportunity for some optimization here? Why is FROM (much) worse thanGRAPH?
I also wonder why TDB2 is so much slower than TDB1, especially for thefirst version of the query. It should be an improvement, right? Ofcourse there are trade-offs in implementing any complex system. But itmakes me think whether we should stick to TDB1 for the time being, asthere are no obvious benefits in using TDB2 for our current use.
Likewise, this makes me wonder whether there has been a mild decrease inperformance between Jena 3.8.0 and 3.16.0 - though I didn't look atintermediate versions to pinpoint the exact change (or several) thatwould be causing the slowdown. If there's interest, I can try otherversions as well.
For now we will probably just change Skosmos to use the GRAPH variant ofthe query, which should fix the immediate problems with timeouts.Unfortunately I don't have the skills to work directly on the ARQoptimizer or TDB2 code bases. But I'd be happy to test other variationsand potential fixes to these performance problems.
Cheers,
Osma

[1] https://finto.fi/rest/v1/finaf/data?format=text/turtle

Re: Performance regressions in Jena and TDB2

Reply via email to