Hi Osma,
This is misused of FROM.
Using FROM is not as simple as just selection one graph from the
dataset. It is creating a complete new dataset comprised of graph picked
out of a pool of graph in the TDB database.
The app should use GRAPH if that is what it means!
FROM is more.
FROM can be several graphs
SELECT *
FROM <g1>
FROM <g2>
WHERE ..
meaning that the default graph is the union view of g1 and g2. And also
that there are no named graphs (no use of FROM NAMED) so using GRAPH in
the query will see nothing.
As to why there is a big difference - I don't know but the right thing
is not to use FROM unless you want the union and hiding effects. It's
about features not performance (the low level optimizer is being
partially blocked - no quads).
TDB2 is not about being faster than TDB1 - it's about better scale of
transactions, more robust and better handling of inline datatypes.
The TDB2 bulk loader is faster ... with enough hardware much faster
(--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster for
multibillions datasets.
Query-wise they are comparable.
There are two possible effects:
If you did a straight bulk load of the two datasets then TDB1 is going
to be a little faster because the dataset has a slightly better layout
initially. However, if any updates happen, the difference will fade -
it's also a reflection of the data base size (470Mi) and OS caching.
It would be interesting to see the query figures if you loaded as
triples - the triples table is smaller than the quads table with one graph.
Another effect is that DB2 has a better internal design with more
abstraction and classes. It may take more time for the JIT to fully
optimize.
Likewise, this makes me wonder whether there has been a mild decrease in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at intermediate versions to pinpoint the exact change (or several) that would be causing the slowdown. If there's interest, I can try other versions as well.
Yes please. That would be useful to know.
Andy
PS 1991-27-59 is not a date :-)
On 30/11/2020 15:34, Osma Suominen wrote:
Hello,
We're in the process of replacing an old server that was still running
Fuseki1 from Jena 3.8.0 with a TDB1 store. The new server has Fuseki2
from Jena 3.16.0 and a TDB2 store.
While testing the new server, I noticed that the new Fuseki is running a
particular SPARQL query much slower than the old one. This is a query
performed by Skosmos to find out all the letters of the alphabet for an
alphabetical index by looking at all the skos:prefLabel values in a
specific language. It's expected to be a bit slow (several seconds)
since it needs to look at all the labels - but on the new server, the
query is almost an order of magnitude slower, which is causing timeout
issues.
To investigate this more closely, I decided to drop Fuseki out of the
equation and just use Jena command line utilities. I wanted to compare
the effect of Jena versions (3.8.0 vs 3.16.0), store type (TDB1 vs
TDB2), and variations of the original SPARQL query. For the data, I used
the newly published KANTO/finaf data set (an authority file of named
entities, i.e. persons and organizations) which can be downloaded from
finto.fi [1]. It has around 3M triples, 200k skos:Concept instances and
the same number of skos:prefLabel values.
I loaded this into a TDB1 data set using Jena 3.7.0 (because of
JENA-1575) like this:
apache-jena-3.7.0/bin/tdbloader --loc tdb1 --graph
http://example.org/finaf finaf-skos.ttl
Likewise, I loaded the same data set into a TDB2 store using Jena 3.16.0:
apache-jena-3.16.0/bin/tdb2.tdbloader --loc tdb2 --graph
http://example.org/finaf finaf-skos.ttl
This is the original SPARQL query:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
FROM <http://example.org/finaf>
WHERE {
?c a skos:Concept .
?c skos:prefLabel ?label .
FILTER(langMatches(lang(?label), 'fi'))
}
The query should return 68 results. There is no particular order since
there is no ORDER BY, they are just a set of letters and special
characters such as numbers and punctuation.
I ran it using tdbquery / tdb2.tdbquery, separately on both Jena 3.8.0
and 3.16.0, using the options --time --repeat 2,10 (benchmark; two
rounds of warming up, ten rounds of benchmarking) and wrote down the
average query time, rounded to the first decimal point. I'm doing the
benchmarks on an i5-7200U laptop with a pretty fast SSD.
Jena 3.8.0 / TDB1: 2.1s
Jena 3.16.0 / TDB1: 2.4s
Jena 3.8.0 / TDB2: 11.8s
Jena 3.16.0 / TDB2: 12.0s
The difference between Jena versions is not very significant, but TDB2
is 5-6 times slower than TDB1. Here is how tdbquery -v explains the
query on the TDB level:
17:06:07 INFO exec :: TDB
(distinct
(project (?l)
(extend ((?l (ucase (str (substr ?label 1 1)))))
(filter (langMatches (lang ?label) "fi")
(bgp
(triple ?c
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2004/02/skos/core#Concept>)
(triple ?c <http://www.w3.org/2004/02/skos/core#prefLabel>
?label)
)))))
The explanation is identical for tdb2.tdbquery so I won't repeat it.
I then looked at ways of optimizing the query to make it perform better.
After trying many variations (for example reordering the clauses and/or
moving the substring expression to a BIND variable), the only change
that seemed to have a significant effect was to remove the FROM clause
and instead insert a GRAPH clause targeting the same graph, like this:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
WHERE {
GRAPH <http://example.org/finaf> {
?c a skos:Concept .
?c skos:prefLabel ?label .
FILTER(langMatches(lang(?label), 'fi'))
}
}
Benchmark results for this GRAPH version of the query:
Jena 3.8.0 / TDB1: 0.9s
Jena 3.16.0 / TDB1: 1.3s
Jena 3.8.0 / TDB2: 1.4s
Jena 3.16.0 / TDB2: 1.9s
The results are much more even this time, though Jena 3.16.0 is about
40% slower than 3.8.0 and TDB2 is about 50% slower than TDB1. tdbquery
-v (and tdb2.tdbquery -v) explains the query like this:
17:13:02 INFO exec :: TDB
(distinct
(project (?l)
(extend ((?l (ucase (str (substr ?label 1 1)))))
(filter (langMatches (lang ?label) "fi")
(quadpattern
(quad <http://example.org/finaf> ?c
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2004/02/skos/core#Concept>)
(quad <http://example.org/finaf> ?c
<http://www.w3.org/2004/02/skos/core#prefLabel> ?label)
)))))
The difference I see compared to the previous query is the use of "quad"
instead of "triple". My understanding of operations on the TDB level is
pretty naive, but it seems to me this is now targeting the correct graph
directly, instead of indirectly, as in the first case. This is a bit
surprising to me since the "FROM <http://example.org/finaf>" clause in
the first query is, to me, saying the same thing as the GRAPH clause:
just target triples in this particular graph. Is there a missed
opportunity for some optimization here? Why is FROM (much) worse than
GRAPH?
I also wonder why TDB2 is so much slower than TDB1, especially for the
first version of the query. It should be an improvement, right? Of
course there are trade-offs in implementing any complex system. But it
makes me think whether we should stick to TDB1 for the time being, as
there are no obvious benefits in using TDB2 for our current use.
Likewise, this makes me wonder whether there has been a mild decrease in
performance between Jena 3.8.0 and 3.16.0 - though I didn't look at
intermediate versions to pinpoint the exact change (or several) that
would be causing the slowdown. If there's interest, I can try other
versions as well.
For now we will probably just change Skosmos to use the GRAPH variant of
the query, which should fix the immediate problems with timeouts.
Unfortunately I don't have the skills to work directly on the ARQ
optimizer or TDB2 code bases. But I'd be happy to test other variations
and potential fixes to these performance problems.
Cheers,
Osma
[1] https://finto.fi/rest/v1/finaf/data?format=text/turtle