On 01/12/2020 14:50, Osma Suominen wrote:
Hi Andy!

Thanks again for your insight on this.

Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
This is misused of FROM.

Using FROM is not as simple as just selection one graph from the dataset. It is creating a complete new dataset comprised of graph picked out of a pool of graph in the TDB database.

The app should use GRAPH if that is what it means!

It's a bit harsh to call this "misuse" of FROM, but I see your point. I've been using SPARQL for over 10 years and I still get tripped up by things that work (or not) in surprising ways...

Anyway I created a Skosmos PR [1] that switches this query to use GRAPH, so the performance issue goes away. We may have to look at other queries as well to see if they can be optimized by avoiding FROM - especially now that we use TDB2 which seems to be more sensitive about this.

There is no reason I can see why the special case of exactly one "FROM" can't be handled specially. It masks all named graphs but is a rewrite from triples, that will be fine.

FROM is more.
FROM can be several graphs

SELECT *
FROM <g1>
FROM <g2>
WHERE ..

meaning that the default graph is the union view of g1 and g2.  And also that there are no named graphs (no use of FROM NAMED) so using GRAPH in the query will see nothing.

From a naive user perspective, here are a few reasons to prefer FROM over GRAPH:

1. The syntax is more convenient and minimal. To target a specific graph, add a FROM clause; to target the default graph, just remove it (or comment out). GRAPH is more cumbersome since it adds another nesting layer.

2. It's more versatile, as you say (though in this case the query didn't really make use of the additional flexibility)

3. It looks more familiar if you know SQL.

As to why there is a big difference - I don't know but the right thing is not to use FROM unless you want the union and hiding effects. It's about features not performance (the low level optimizer is being partially blocked - no quads).

I see, that's good to know - but IMHO not obvious just reading the SPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].

Just out of curiosity, I loaded the same data set into an RDF4J native store (default settings) and tested both the FROM and GRAPH variants of the query. There was no measurable difference in execution time: both queries took 3.3 seconds. So this is slower than TDB1, and TDB2 with GRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4J doesn't have such a big difference between FROM and GRAPH (but this observation is based only on a single data point!).

RDF4J has a different graph storage. In a way, it always "FROM" like. I don't know if it is the same full FROM/FROM NAMED feature.

The spec is carefully worded by design to allow the different systems, RDF4J included to operate in a compliant manner. Systems already had different behaviours.


TDB2 is not about being faster than TDB1 - it's about better scale of transactions, more robust and better handling of inline datatypes. The TDB2 bulk loader is faster ... with enough hardware much faster (--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster for multibillions datasets.

Query-wise they are comparable.

There are two possible effects:

If you did a straight bulk load of the two datasets then TDB1 is going to be a little faster because the dataset has a slightly better layout initially. However, if any updates happen, the difference will fade - it's also a reflection of the data base size (470Mi) and OS caching.

Understood. I also agree about TDB2 having a better design and especially the improved transaction support is valuable. I don't think we want to go back to TDB1.

It would be interesting to see the query figures if you loaded as triples - the triples table is smaller than the quads table with one graph.

Tested this briefly using Jena 3.16.0 only. I loaded the file into the default graph of both a TDB1 and a TDB2 store, and timed the query which of course had neither FROM nor GRAPH. I noticed that there was quite a bit of variance in the timings I did yesterday so I went for another measurement approach this time - I used 10 repetitions (without warmup) and wrote down the minimum time from the 10 query executions.

TDB1: 0.825s
TDB2: 1.127s

No real difference to the quad version using GRAPH (see below).

For me:
About 300-400ms is going in the SELECT expression.


Another effect is that DB2 has a better internal design with more abstraction and classes. It may take more time for the JIT to fully optimize.

When using --repeat 10, the first query takes much longer but after that there is no obvious pattern for the next queries. Some are faster than others, but the second one is already as fast as any of the others.

The query timing has a bug - it does not preload the result set formatting during the warmup so classloading (and JIT but that is a lesser issue) happen during the main execution. Approximately, junk the first timed run.

The rest go up and down because :
(1) other stuff on the machine going tick
(2) bits of successive JIT happening


Likewise, this makes me wonder whether there has been a mild decrease in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at intermediate versions to pinpoint the exact change (or several) that would be causing the slowdown. If there's interest, I can try other versions as well.

Yes please. That would be useful to know.

OK, I did some additional measurements. I downloaded all releases from 3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query on both TDB1 and TDB2 stores. Again this is the minimum time for 10 repetitions.

Sidenote: It's quite difficult to do accurate benchmarking on a laptop.

Yes, isn't it :-)

There are other application both competign for RAM and also waking up in the background (Chrome! Thunderbird!), and the OS doing file system cache maintenance.

I tried to eliminate interference from other applications, but there are still aspects that are hard to control such as CPU frequency, which may vary e.g. by temperature.

Here are the results:

TDB1

3.8.0: 0.904s
3.9.0: 0.901s
3.10.0: 0.874s
3.11.0: 0.906s
3.12.0: 0.871s
3.13.0: 0.903s
3.14.0: 0.917s
3.15.0: 0.916s
3.16.0: 0.892s


TDB2

3.8.0: 1.300s
3.9.0: 1.169s
3.10.0: 1.319s
3.11.0: 1.147s
3.12.0: 1.264s
3.13.0: 1.108s
3.14.0: 1.158s
3.15.0: 1.323s
3.16.0: 1.285s

There is no clear difference between versions this time (though TDB2 is a bit slower overall than TDB1). It looks like the 40% decrease in performance between 3.8.0 and 3.16.0 that I reported yesterday was a measurement error; though it's conceivable that the average query time has increased even if the minimum times have not. Nevertheless I don't think this requires further study.

PS 1991-27-59 is not a date :-)

I know :) This is what happens when you convert 200k MARC records from a legacy system into RDF - errors in the data that nobody noticed before will suddenly crop up. There is also a bad URI value starting with a colon, though Jena doesn't complain loudly about it (RDF4J chokes though). These have already been fixed in the original records, so they will be corrected soon in the RDF data set as well when we regenerate it in the next few days.

-Osma

[1] https://github.com/NatLibFi/Skosmos/pull/1098

[2] https://jena.apache.org/tutorials/sparql_datasets.html



Reply via email to