Re: Performance regressions in Jena and TDB2

Andy Seaborne Tue, 01 Dec 2020 13:15:52 -0800



On 01/12/2020 14:50, Osma Suominen wrote:

Hi Andy!

Thanks again for your insight on this.

Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
This is misused of FROM.
Using FROM is not as simple as just selection one graph from thedataset. It is creating a complete new dataset comprised of graphpicked out of a pool of graph in the TDB database.
The app should use GRAPH if that is what it means!
It's a bit harsh to call this "misuse" of FROM, but I see your point.I've been using SPARQL for over 10 years and I still get tripped up bythings that work (or not) in surprising ways...
Anyway I created a Skosmos PR [1] that switches this query to use GRAPH,so the performance issue goes away. We may have to look at other queriesas well to see if they can be optimized by avoiding FROM - especiallynow that we use TDB2 which seems to be more sensitive about this.

There is no reason I can see why the special case of exactly one "FROM"can't be handled specially. It masks all named graphs but is a rewritefrom triples, that will be fine.

FROM is more.
FROM can be several graphs

SELECT *
FROM <g1>
FROM <g2>
WHERE ..
meaning that the default graph is the union view of g1 and g2. Andalso that there are no named graphs (no use of FROM NAMED) so usingGRAPH in the query will see nothing.
From a naive user perspective, here are a few reasons to prefer FROMover GRAPH:
1. The syntax is more convenient and minimal. To target a specificgraph, add a FROM clause; to target the default graph, just remove it(or comment out). GRAPH is more cumbersome since it adds another nestinglayer.
2. It's more versatile, as you say (though in this case the query didn'treally make use of the additional flexibility)
3. It looks more familiar if you know SQL.
As to why there is a big difference - I don't know but the right thingis not to use FROM unless you want the union and hiding effects. It'sabout features not performance (the low level optimizer is beingpartially blocked - no quads).
I see, that's good to know - but IMHO not obvious just reading theSPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].
Just out of curiosity, I loaded the same data set into an RDF4J nativestore (default settings) and tested both the FROM and GRAPH variants ofthe query. There was no measurable difference in execution time: bothqueries took 3.3 seconds. So this is slower than TDB1, and TDB2 withGRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4Jdoesn't have such a big difference between FROM and GRAPH (but thisobservation is based only on a single data point!).

RDF4J has a different graph storage. In a way, it always "FROM" like. Idon't know if it is the same full FROM/FROM NAMED feature.

The spec is carefully worded by design to allow the different systems,RDF4J included to operate in a compliant manner. Systems already haddifferent behaviours.

TDB2 is not about being faster than TDB1 - it's about better scale oftransactions, more robust and better handling of inline datatypes.The TDB2 bulk loader is faster ... with enough hardware much faster(--loader=parallel) at 100's millions. TDB1 tdbloader2 can be fasterfor multibillions datasets.
Query-wise they are comparable.

There are two possible effects:
If you did a straight bulk load of the two datasets then TDB1 is goingto be a little faster because the dataset has a slightly better layoutinitially. However, if any updates happen, the difference will fade -it's also a reflection of the data base size (470Mi) and OS caching.
Understood. I also agree about TDB2 having a better design andespecially the improved transaction support is valuable. I don't thinkwe want to go back to TDB1.
It would be interesting to see the query figures if you loaded astriples - the triples table is smaller than the quads table with onegraph.
Tested this briefly using Jena 3.16.0 only. I loaded the file into thedefault graph of both a TDB1 and a TDB2 store, and timed the query whichof course had neither FROM nor GRAPH. I noticed that there was quite abit of variance in the timings I did yesterday so I went for anothermeasurement approach this time - I used 10 repetitions (without warmup)and wrote down the minimum time from the 10 query executions.
TDB1: 0.825s
TDB2: 1.127s

No real difference to the quad version using GRAPH (see below).


For me:
About 300-400ms is going in the SELECT expression.

Another effect is that DB2 has a better internal design with moreabstraction and classes. It may take more time for the JIT to fullyoptimize.
When using --repeat 10, the first query takes much longer but after thatthere is no obvious pattern for the next queries. Some are faster thanothers, but the second one is already as fast as any of the others.

The query timing has a bug - it does not preload the result setformatting during the warmup so classloading (and JIT but that is alesser issue) happen during the main execution. Approximately, junk thefirst timed run.


The rest go up and down because :
(1) other stuff on the machine going tick
(2) bits of successive JIT happening

Likewise, this makes me wonder whether there has been a mild decreasein performance between Jena 3.8.0 and 3.16.0 - though I didn't lookat intermediate versions to pinpoint the exact change (or several)that would be causing the slowdown. If there's interest, I can tryother versions as well.
Yes please. That would be useful to know.
OK, I did some additional measurements. I downloaded all releases from3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query onboth TDB1 and TDB2 stores. Again this is the minimum time for 10repetitions.
Sidenote: It's quite difficult to do accurate benchmarking on a laptop.


Yes, isn't it :-)

There are other application both competign for RAM and also waking up inthe background (Chrome! Thunderbird!), and the OS doing file systemcache maintenance.

I tried to eliminate interference from other applications, but there arestill aspects that are hard to control such as CPU frequency, which mayvary e.g. by temperature.
Here are the results:

TDB1

3.8.0: 0.904s
3.9.0: 0.901s
3.10.0: 0.874s
3.11.0: 0.906s
3.12.0: 0.871s
3.13.0: 0.903s
3.14.0: 0.917s
3.15.0: 0.916s
3.16.0: 0.892s


TDB2

3.8.0: 1.300s
3.9.0: 1.169s
3.10.0: 1.319s
3.11.0: 1.147s
3.12.0: 1.264s
3.13.0: 1.108s
3.14.0: 1.158s
3.15.0: 1.323s
3.16.0: 1.285s
There is no clear difference between versions this time (though TDB2 isa bit slower overall than TDB1). It looks like the 40% decrease inperformance between 3.8.0 and 3.16.0 that I reported yesterday was ameasurement error; though it's conceivable that the average query timehas increased even if the minimum times have not. Nevertheless I don'tthink this requires further study.
PS 1991-27-59 is not a date :-)
I know :) This is what happens when you convert 200k MARC records from alegacy system into RDF - errors in the data that nobody noticed beforewill suddenly crop up. There is also a bad URI value starting with acolon, though Jena doesn't complain loudly about it (RDF4J chokesthough). These have already been fixed in the original records, so theywill be corrected soon in the RDF data set as well when we regenerate itin the next few days.
-Osma

[1] https://github.com/NatLibFi/Skosmos/pull/1098

[2] https://jena.apache.org/tutorials/sparql_datasets.html

Re: Performance regressions in Jena and TDB2

Reply via email to