TL;DR It’s a workaround for this issue because it can force the optimiser to behave differently, however it should be used sparingly as overuse may prevent other optimisations that may yield more benefit than you lose elsewhere.
See Andy’s recent email [1] that the offending optimisation will be disabled by default in future releases so the workaround will not be needed longer term. ---- The long-winded details for those interested (although I’m still glossing over lots of low level details)… There are two levels of query optimisation in ARQ: 1. Logical optimisation (sometimes referred to as algebra optimisation) 2. Execution optimisation The logical optimiser works at the SPARQL Algebra level and looks to make transformations to the algebra that are known to improve performance based on experience, past research etc. In doing so the optimiser has to ensure that those transformations are semantically safe, i.e., they MUST NOT change the overall semantics of the query and result in the same answers as the original query. Therefore, many of these optimisations are applied quite conservatively so if ARQ cannot determine that a given transformation would be semantically dependent it won’t apply it. Additionally in some cases, these optimisations are specifically intended to be chained, i.e., doing one optimisation may enable further optimisations, thus the ARQ optimiser applies the various transformations in a specific order. See https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize/OptimizerStd.java if you want to see the raw details of this and some explanatory comments around the ordering. Some of these logical transformations are also done to enable execution optimiser behaviour later in query evaluation e.g., join linearisation. The downside of the logical optimiser is that it works purely by static analysis of the algebra i.e., without reference to the dataset against the query will ultimately evaluate. This means that sometimes it can make decisions that are good for the general case BUT bad for some datasets. The execution optimiser is a whole bunch of things done during actual execution to improve performance. This includes everything from Jena’s streaming-based iterator implementation of query execution (effectively a Volcano based evaluation model [2]), ARQ’s join linearisation operators, TDB’s low level Node IDs and direct expression evaluation over those, Node ID to RDF Term caching (and vice-versa), memory-mapping of database indices etc. Execution also includes BGP reordering, when the query evaluator gets a BGP to evaluate it can choose to apply reordering to the triple patterns within that BGP. For TDB this is controlled by the presence, or lack thereof, of a stats/fixed/none.opt file in the database directory. Having a relevant file present should apply further execution time BGP reordering that can be statistics aware and thus avoids the issue of the logical BGP reordering. Since during execution of a single BGP bindings from earlier triple patterns are used to restrict the searches made for subsequent triple patterns the order of execution of the triple patterns can be important, especially if one triple pattern has many matches. However, if you are querying an in-memory dataset instead of TDB, then you may not be getting any execution time BGP reordering so you’re left evaluating the triple patterns according to the logical BGP reordering that may turn out to be sub-optimal depending on your dataset. --- The specific problem discussed in this thread is due to a new optimisation that was introduced in Jena 4.5.0 (BGP Reordering during logical optimisation), this was an optimisation that was shown to improve performance on some benchmarks as it enables more aggressive application of another optimisation (filter placement). The reason it did the opposite on some users’ dataset is that it is done without any knowledge of the data (as it’s a logical optimisation) and can result in breaking up a BGP into separate BGPs causing less specific triple patterns to be evaluated prior to more specific ones. Or where no execution time BGP reordering occurs can leave the triple patterns in a sub-optimal order for evaluation even if BGPs are not split. The short-term fix (again see Andy’s email [1]) is to disable this optimisation by default, users can opt back into it if they find it benefits their usage of Jena on their datasets. The long-term fix is probably to rearchitect the logical optimiser in some way to allow more data context to be visible to it i.e., making the logical BGP reordering statistics aware, making ARQ’s overall optimisation strategy more hybrid. If anyone is interested, I’d imagine there’ll be a thread on this on the dev list soon Hope this helps, Rob [1]: https://lists.apache.org/thread/37cloogcb3wzmkl0s33ttnxyg0kvq69p [2]: http://daslab.seas.harvard.edu/reading-group/papers/volcano.pdf From: Mikael Pesonen <[email protected]> Date: Tuesday, 8 November 2022 at 11:04 To: [email protected] <[email protected]> Subject: Re: Weird sparql problem Both your suggestions for rewriting the query worked. I'm lost with the reasons, but for future cases, breaking problematic queries with {} is they way to go? On 04/11/2022 11.25, [email protected] wrote: > So yes as suspected the triple patterns are being reordered badly in the BGP: > > (sequence > (table (vars ?sct_code) > (row [?sct_code "298314008"]) > ) > (bgp > (triple ?c skos:inScheme lsu:SNOMEDCT_US) > (triple ?c skosxl:prefLabel ??0) > (triple ??0 lsu:code ?sct_code) > ))) > > The optimizer doesn’t take into account the fact that the ?sct_code variable > is going to be bound by the VALUES clause (table in the algebra) so considers > that the least specific triple pattern (as it has two variables) causing it > to evaluate a much less specific triple pattern first. > > Lorenz’s suggestion of generating statistics for your dataset is a good one, > statistics would likely guide the optimiser that the ?c skos:inScheme > lsu:SNOMEDCT_US triple is actually very non-specific for your dataset. > > You could also try Andy’s suggestion else-thread i.e. --set > arq:optReorderBGP=false passed to the CLI command in question, or if this is > being called from code ARQ.getContext().set(ARQ.optReorderBGP, false); > > The other thing you can do is explicitly break up your query further i.e. > > { VALUES ?sct_code { "298314008" } > { _:b0 lsu:code ?sct_code . > ?c skosxl:prefLabel _:b0 . } > { ?c skos:inScheme lsu:SNOMEDCT_US } > } > > Essentially forcing the engine to evaluate that very unspecific triple > pattern last > > Another possibility would be to change that triple pattern to be in a FILTER > EXISTS condition, so it’d only be evaluated for matches to your other triple > patterns i.e. > > { VALUES ?sct_code { "298314008" } > _:b0 lsu:code ?sct_code . > ?c skosxl:prefLabel _:b0 . > FILTER EXISTS { ?c skos:inScheme lsu:SNOMEDCT_US } > } > > Hope this helps, > > Rob > > From: Lorenz Buehmann <[email protected]> > Date: Thursday, 3 November 2022 at 11:12 > To: [email protected] <[email protected]> > Subject: Re: Re: Weird sparql problem > tdbquery --explain --loc $TDB_LOC "query here" > > would also work to see the plan - maybe also increase log level to see > more: https://jena.apache.org/documentation/tdb/optimizer.html > > Another question, did you generate the TDB stats such those could be > used by the optimizer? > > for debugging purpose, you could also disable query optimization (put an > empty none.opt file into $TDB_LOC/Data-0001 dir) and reorder your query > manually, i.e. > >> WHERE >> { VALUES ?sct_code { "298314008" } >> _:b0 lsu:code ?sct_code . >> ?c skosxl:prefLabel _:b0 . >> ?c skos:inScheme lsu:SNOMEDCT_US >> } > without stats and based on heuristics (e.g. number of variables in > triple pattern), otherwise the last triple pattern might always be > evaluated first > > > On 03.11.22 11:11, Mikael Pesonen wrote: >> Here's the parse, hope it helps: >> >> WHERE >> { VALUES ?sct_code { "298314008" } >> ?c skosxl:prefLabel _:b0 . >> _:b0 lsu:code ?sct_code . >> ?c skos:inScheme lsu:SNOMEDCT_US >> } >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> (prefix ((owl: >> <http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl>>) >> (rdf: >> <http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>>) >> (skosxl: >> <http://www.w3.org/2008/05/skos-xl#<http://www.w3.org/2008/05/skos-xl>>) >> (skos: >> <http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core>>) >> (dcterms: <http://purl.org/dc/terms/>) >> (rdfs: >> <http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema>>) >> (lsr: <https://resource.lingsoft.fi/>) >> (id: <http://snomed.info/id/>) >> (dcat: <http://www.w3.org/ns/dcat#<http://www.w3.org/ns/dcat>>) >> (dc: <http://purl.org/dc/elements/1.1/>) >> (lsu: <https://www.lingsoft.fi/ns/umls/>)) >> (sequence >> (table (vars ?sct_code) >> (row [?sct_code "298314008"]) >> ) >> (bgp >> (triple ?c skos:inScheme lsu:SNOMEDCT_US) >> (triple ?c skosxl:prefLabel ??0) >> (triple ??0 lsu:code ?sct_code) >> ))) >> >> >> On 02/11/2022 12.32, [email protected] wrote: >>> For these kind of performance issues it is useful to see the SPARQL >>> algebra for the whole query, not just fragments of the query. You >>> can use the qparse command for the version of Jena you are using to >>> see how it is optimising your queries e.g. >>> >>> qparse --explain --query example.rq >>> >>> As Lorenz suggests this may be the optimiser making a bad guess at >>> the appropriate order in which to evaluate the triple patterns within >>> the BGP but without the larger query context or the algebra all we >>> can do is guess. >>> >>> Rob >>> >>> From: Mikael Pesonen <[email protected]> >>> Date: Tuesday, 1 November 2022 at 12:53 >>> To: [email protected] <[email protected]> >>> Subject: Re: Weird sparql problem >>> Diferent case, but again hanging makes no sense to user, whatever are >>> the technical reasons. >>> >>> VALUES ?sct_code { "298314008" } >>> ?c skosxl:prefLabel [ lsu:code ?sct_code ] >>> >>> returns one row immediately, but >>> >>> VALUES ?sct_code { "298314008" } >>> ?c skosxl:prefLabel [ lsu:code ?sct_code ]; skos:inScheme >>> lsu:SNOMEDCT_US >>> >>> hangs forever >>> >>> >>> skos:inScheme lsu:SNOMEDCT_US; >>> >>> On 18/10/2022 9.08, Lorenz Buehmann wrote: >>>> Hi, >>>> >>>> comments inline >>>> >>>> On 17.10.22 14:35, Mikael Pesonen wrote: >>>>> This works as a separate query, but not in a the middle, since ?s >>>>> gets new values instead of binding to previous ?s. >>>>> >>>>> { select ?t where { >>>>> ?s a ?t . >>>>> } limit 10} >>>>> ?t skos:prefLabel ?l >>>> In the middle of what? Subqueries will be evaluated first - if you >>>> really want labels for classes, you should use a DISTINCT in the >>>> subquery such that the intermediate result is small, there shouldn't >>>> be that many classes, but many instances with the same class, thus, >>>> the join would be more expensive than necessary. >>>> >>>> >>>>> On 17/10/2022 14.56, Mikael Pesonen wrote: >>>>>> ?s a ?t . >>>>>> ?t skos:prefLabel ?l >>>>>> >>>>>> returns 3 million triples. Maybe it's related to this? >>>> I don't see how this should be related to your initial query where ?s >>>> was bound, which in my opinion should be an easy join. Is it possible >>>> for you to share the dataset somehow? Also, what you can do is to >>>> compute statistics for the TDB database with tdbstats tool [1] from >>>> commandline and put it into the TDB folder. But even without the query >>>> plan should take the first triple pattern, use the spo index as s and >>>> p are bound, then pass the bindings of ?o to the evaluation of the >>>> second triple pattern >>>> >>>> [1] >>>> https://jena.apache.org/documentation/tdb/optimizer.html#generating-a-statistics-file >>>> >>>> >>>> >>>>>> On 21/09/2022 9.15, Lorenz Buehmann wrote: >>>>>>> Weird, only 10M triples and each triple pattern returns only 1 >>>>>>> binding, thus, the size is tiny - honestly I can't think of >>>>>>> anything except for open connections, but as you mentioned, running >>>>>>> the queries with only one triple pattern works as expected, so that >>>>>>> too many open connections shouldn't be an issue most likely. >>>>>>> >>>>>>> Can you reproduce this behavior with newer Jena versions like 4.6.1? >>>>>>> >>>>>>> Or can you reproduce this on different servers as well? >>>>>>> >>>>>>> Is it also stuck of your run the query directly after you restart >>>>>>> Fuseki? >>>>>>> >>>>>>> >>>>>>> On 19.09.22 13:49, Mikael Pesonen wrote: >>>>>>>> On 15/09/2022 17.48, Lorenz Buehmann wrote: >>>>>>>>> Forgot: >>>>>>>>> >>>>>>>>> - size of result for each triple pattern? Might affect if hash >>>>>>>>> join can be used. >>>>>>>> It's one row for each. >>>>>>>>> - your hardware? >>>>>>>> Normal server with 16gigs mem. >>>>>>>>> - is it just the first query after starting Fuseki? Connections >>>>>>>>> have been closed? Note, there was also a bug in a recent Jena >>>>>>>>> version, but only with TDB and too many open connections. It has >>>>>>>>> been resolved with release 4.6.1. >>>>>>>> Jena has been running quite a while. >>>>>>>>> Might not be related, but I'm mentioning all things here >>>>>>>>> nevertheless. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 15.09.22 11:16, Mikael Pesonen wrote: >>>>>>>>>> This returns one row fast, say :C1 >>>>>>>>>> >>>>>>>>>> SELECT * >>>>>>>>>> FROM <https://a.b.c> >>>>>>>>>> WHERE { >>>>>>>>>> <https://x.y.z> a ?t . >>>>>>>>>> #?t skos:prefLabel ?l >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> and this too: >>>>>>>>>> >>>>>>>>>> SELECT * >>>>>>>>>> FROM <https://a.b.c> >>>>>>>>>> WHERE { >>>>>>>>>> #<https://x.y.z> a ?t . >>>>>>>>>> :C1 skos:prefLabel ?l >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> But this always hangs until timeout >>>>>>>>>> >>>>>>>>>> SELECT * >>>>>>>>>> FROM <https://a.b.c> >>>>>>>>>> WHERE { >>>>>>>>>> <https://x.y.z> a ?t . >>>>>>>>>> ?t skos:prefLabel ?l >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> What am I missing here? I'm using Fuseki web GUI. Thanks! >>> -- >>> Lingsoft - 30 years of Leading Language Management >>> >>> www.lingsoft.fi<http://www.lingsoft.fi> >>> >>> Speech Applications - Language Management - Translation - Reader's >>> and Writer's Tools - Text Tools - E-books and M-books >>> >>> Mikael Pesonen >>> System Engineer >>> >>> e-mail: [email protected] >>> Tel. +358 2 279 3300 >>> >>> Time zone: GMT+2 >>> >>> Helsinki Office >>> Eteläranta 10 >>> FI-00130 Helsinki >>> FINLAND >>> >>> Turku Office >>> Kauppiaskatu 5 A >>> FI-20100 Turku >>> FINLAND >>> -- Lingsoft - 30 years of Leading Language Management www.lingsoft.fi<http://www.lingsoft.fi> Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books Mikael Pesonen System Engineer e-mail: [email protected] Tel. +358 2 279 3300 Time zone: GMT+2 Helsinki Office Eteläranta 10 FI-00130 Helsinki FINLAND Turku Office Kauppiaskatu 5 A FI-20100 Turku FINLAND
