Re: Weird sparql problem

[email protected] Wed, 09 Nov 2022 01:44:13 -0800

TL;DR

It’s a workaround for this issue because it can force the optimiser to behave 
differently, however it should be used sparingly as overuse may prevent other 
optimisations that may yield more benefit than you lose elsewhere.

See Andy’s recent email [1] that the offending optimisation will be disabled by 
default in future releases so the workaround will not be needed longer term.

----

The long-winded details for those interested (although I’m still glossing over 
lots of low level details)…

There are two levels of query optimisation in ARQ:

  1.  Logical optimisation (sometimes referred to as algebra optimisation)
  2.  Execution optimisation

The logical optimiser works at the SPARQL Algebra level and looks to make 
transformations to the algebra that are known to improve performance based on 
experience, past research etc.  In doing so the optimiser has to ensure that 
those transformations are semantically safe, i.e., they MUST NOT change the 
overall semantics of the query and result in the same answers as the original 
query.  Therefore, many of these optimisations are applied quite conservatively 
so if ARQ cannot determine that a given transformation would be semantically 
dependent it won’t apply it.

Additionally in some cases, these optimisations are specifically intended to be 
chained, i.e., doing one optimisation may enable further optimisations, thus 
the ARQ optimiser applies the various transformations in a specific order.  See 
https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize/OptimizerStd.java
 if you want to see the raw details of this and some explanatory comments 
around the ordering.  Some of these logical transformations are also done to 
enable execution optimiser behaviour later in query evaluation e.g., join 
linearisation.

The downside of the logical optimiser is that it works purely by static 
analysis of the algebra i.e., without reference to the dataset against the 
query will ultimately evaluate.  This means that sometimes it can make 
decisions that are good for the general case BUT bad for some datasets.

The execution optimiser is a whole bunch of things done during actual execution 
to improve performance.  This includes everything from Jena’s streaming-based 
iterator implementation of query execution (effectively a Volcano based 
evaluation model [2]), ARQ’s join linearisation operators, TDB’s low level Node 
IDs and direct expression evaluation over those, Node ID to RDF Term caching 
(and vice-versa), memory-mapping of database indices etc.

Execution also includes BGP reordering, when the query evaluator gets a BGP to 
evaluate it can choose to apply reordering to the triple patterns within that 
BGP.  For TDB this is controlled by the presence, or lack thereof, of a 
stats/fixed/none.opt file in the database directory.  Having a relevant file 
present should apply further execution time BGP reordering that can be 
statistics aware and thus avoids the issue of the logical BGP reordering.  
Since during execution of a single BGP bindings from earlier triple patterns 
are used to restrict the searches made for subsequent triple patterns the order 
of execution of the triple patterns can be important, especially if one triple 
pattern has many matches.

However, if you are querying an in-memory dataset instead of TDB, then you may 
not be getting any execution time BGP reordering so you’re left evaluating the 
triple patterns according to the logical BGP reordering that may turn out to be 
sub-optimal depending on your dataset.

---

The specific problem discussed in this thread is due to a new optimisation that 
was introduced in Jena 4.5.0 (BGP Reordering during logical optimisation), this 
was an optimisation that was shown to improve performance on some benchmarks as 
it enables more aggressive application of another optimisation (filter 
placement).

The reason it did the opposite on some users’ dataset is that it is done 
without any knowledge of the data (as it’s a logical optimisation) and can 
result in breaking up a BGP into separate BGPs causing less specific triple 
patterns to be evaluated prior to more specific ones.  Or where no execution 
time BGP reordering occurs can leave the triple patterns in a sub-optimal order 
for evaluation even if BGPs are not split.

The short-term fix (again see Andy’s email [1]) is to disable this optimisation 
by default, users can opt back into it if they find it benefits their usage of 
Jena on their datasets.

The long-term fix is probably to rearchitect the logical optimiser in some way 
to allow more data context to be visible to it i.e., making the logical BGP 
reordering statistics aware, making ARQ’s overall optimisation strategy more 
hybrid.  If anyone is interested, I’d imagine there’ll be a thread on this on 
the dev list soon

Hope this helps,

Rob

[1]: https://lists.apache.org/thread/37cloogcb3wzmkl0s33ttnxyg0kvq69p
[2]: http://daslab.seas.harvard.edu/reading-group/papers/volcano.pdf

From: Mikael Pesonen <[email protected]>
Date: Tuesday, 8 November 2022 at 11:04
To: [email protected] <[email protected]>
Subject: Re: Weird sparql problem
Both your suggestions for rewriting the query worked. I'm lost with the
reasons, but for future cases, breaking problematic queries with {} is
they way to go?

On 04/11/2022 11.25, [email protected] wrote:
> So yes as suspected the triple patterns are being reordered badly in the BGP:
>
>    (sequence
>      (table (vars ?sct_code)
>        (row [?sct_code "298314008"])
>      )
>      (bgp
>        (triple ?c skos:inScheme lsu:SNOMEDCT_US)
>        (triple ?c skosxl:prefLabel ??0)
>        (triple ??0 lsu:code ?sct_code)
>      )))
>
> The optimizer doesn’t take into account the fact that the ?sct_code variable 
> is going to be bound by the VALUES clause (table in the algebra) so considers 
> that the least specific triple pattern (as it has two variables) causing it 
> to evaluate a much less specific triple pattern first.
>
> Lorenz’s suggestion of generating statistics for your dataset is a good one, 
> statistics would likely guide the optimiser that the ?c skos:inScheme 
> lsu:SNOMEDCT_US triple is actually very non-specific for your dataset.
>
> You could also try Andy’s suggestion else-thread i.e. --set 
> arq:optReorderBGP=false passed to the CLI command in question, or if this is 
> being called from code ARQ.getContext().set(ARQ.optReorderBGP, false);
>
> The other thing you can do is explicitly break up your query further i.e.
>
> { VALUES ?sct_code { "298314008" }
>    {  _:b0  lsu:code          ?sct_code .
>      ?c    skosxl:prefLabel  _:b0 . }
>    {  ?c    skos:inScheme     lsu:SNOMEDCT_US }
>    }
>
> Essentially forcing the engine to evaluate that very unspecific triple 
> pattern last
>
> Another possibility would be to change that triple pattern to be in a FILTER 
> EXISTS condition, so it’d only be evaluated for matches to your other triple 
> patterns i.e.
>
> { VALUES ?sct_code { "298314008" }
>      _:b0  lsu:code          ?sct_code .
>      ?c    skosxl:prefLabel  _:b0 .
>     FILTER EXISTS {  ?c    skos:inScheme     lsu:SNOMEDCT_US }
>    }
>
> Hope this helps,
>
> Rob
>
> From: Lorenz Buehmann <[email protected]>
> Date: Thursday, 3 November 2022 at 11:12
> To: [email protected] <[email protected]>
> Subject: Re: Re: Weird sparql problem
> tdbquery --explain --loc  $TDB_LOC  "query here"
>
> would also work to see the plan - maybe also increase log level to see
> more: https://jena.apache.org/documentation/tdb/optimizer.html
>
> Another question, did you generate the TDB stats such those could be
> used by the optimizer?
>
> for debugging purpose, you could also disable query optimization (put an
> empty none.opt file into $TDB_LOC/Data-0001 dir)  and reorder your query
> manually, i.e.
>
>> WHERE
>>    { VALUES ?sct_code { "298314008" }
>>    _:b0  lsu:code          ?sct_code .
>>      ?c    skosxl:prefLabel  _:b0 .
>>      ?c    skos:inScheme     lsu:SNOMEDCT_US
>>    }
> without stats and based on heuristics (e.g. number of variables in
> triple pattern), otherwise the last triple pattern might always be
> evaluated first
>
>
> On 03.11.22 11:11, Mikael Pesonen wrote:
>> Here's the parse, hope it helps:
>>
>> WHERE
>>    { VALUES ?sct_code { "298314008" }
>>      ?c    skosxl:prefLabel  _:b0 .
>>      _:b0  lsu:code          ?sct_code .
>>      ?c    skos:inScheme     lsu:SNOMEDCT_US
>>    }
>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> (prefix ((owl: 
>> <http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl>>)
>>           (rdf: 
>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>>)
>>           (skosxl: 
>> <http://www.w3.org/2008/05/skos-xl#<http://www.w3.org/2008/05/skos-xl>>)
>>           (skos: 
>> <http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core>>)
>>           (dcterms: <http://purl.org/dc/terms/>)
>>           (rdfs: 
>> <http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema>>)
>>           (lsr: <https://resource.lingsoft.fi/>)
>>           (id: <http://snomed.info/id/>)
>>           (dcat: <http://www.w3.org/ns/dcat#<http://www.w3.org/ns/dcat>>)
>>           (dc: <http://purl.org/dc/elements/1.1/>)
>>           (lsu: <https://www.lingsoft.fi/ns/umls/>))
>>    (sequence
>>      (table (vars ?sct_code)
>>        (row [?sct_code "298314008"])
>>      )
>>      (bgp
>>        (triple ?c skos:inScheme lsu:SNOMEDCT_US)
>>        (triple ?c skosxl:prefLabel ??0)
>>        (triple ??0 lsu:code ?sct_code)
>>      )))
>>
>>
>> On 02/11/2022 12.32, [email protected] wrote:
>>> For these kind of performance issues it is useful to see the SPARQL
>>> algebra for the whole query, not just fragments of the query.  You
>>> can use the qparse command for the version of Jena you are using to
>>> see how it is optimising your queries e.g.
>>>
>>> qparse --explain --query example.rq
>>>
>>> As Lorenz suggests this may be the optimiser making a bad guess at
>>> the appropriate order in which to evaluate the triple patterns within
>>> the BGP but without the larger query context or the algebra all we
>>> can do is guess.
>>>
>>> Rob
>>>
>>> From: Mikael Pesonen <[email protected]>
>>> Date: Tuesday, 1 November 2022 at 12:53
>>> To: [email protected] <[email protected]>
>>> Subject: Re: Weird sparql problem
>>> Diferent case, but again hanging makes no sense to user, whatever are
>>> the technical reasons.
>>>
>>>     VALUES ?sct_code { "298314008" }
>>>       ?c skosxl:prefLabel [ lsu:code ?sct_code ]
>>>
>>> returns one row immediately, but
>>>
>>>     VALUES ?sct_code { "298314008" }
>>>       ?c skosxl:prefLabel [ lsu:code ?sct_code ]; skos:inScheme
>>> lsu:SNOMEDCT_US
>>>
>>> hangs forever
>>>
>>>
>>>     skos:inScheme lsu:SNOMEDCT_US;
>>>
>>> On 18/10/2022 9.08, Lorenz Buehmann wrote:
>>>> Hi,
>>>>
>>>> comments inline
>>>>
>>>> On 17.10.22 14:35, Mikael Pesonen wrote:
>>>>> This works as a separate query, but not in a the middle, since ?s
>>>>> gets new values instead of binding to previous ?s.
>>>>>
>>>>> { select ?t where {
>>>>> ?s a ?t .
>>>>>    } limit 10}
>>>>>     ?t skos:prefLabel ?l
>>>> In the middle of what? Subqueries will be evaluated first - if you
>>>> really want labels for classes, you should use a DISTINCT in the
>>>> subquery such that the intermediate result is small, there shouldn't
>>>> be that many classes, but many instances with the same class, thus,
>>>> the join would be more expensive than necessary.
>>>>
>>>>
>>>>> On 17/10/2022 14.56, Mikael Pesonen wrote:
>>>>>> ?s a ?t .
>>>>>>     ?t skos:prefLabel ?l
>>>>>>
>>>>>> returns 3 million triples. Maybe it's related to this?
>>>> I don't see how this should be related to  your initial query where ?s
>>>> was bound, which in my opinion should be an easy join. Is it possible
>>>> for you to share the dataset somehow? Also, what you can do is to
>>>> compute statistics for the TDB database with tdbstats tool [1] from
>>>> commandline and put it into the TDB folder. But even without the query
>>>> plan should take the first triple pattern, use the spo index as s and
>>>> p are bound, then pass the bindings of ?o to the evaluation of the
>>>> second triple pattern
>>>>
>>>> [1]
>>>> https://jena.apache.org/documentation/tdb/optimizer.html#generating-a-statistics-file
>>>>
>>>>
>>>>
>>>>>> On 21/09/2022 9.15, Lorenz Buehmann wrote:
>>>>>>> Weird, only 10M triples and each triple pattern returns only 1
>>>>>>> binding, thus, the size is tiny - honestly I can't think of
>>>>>>> anything except for open connections, but as you mentioned, running
>>>>>>> the queries with only one triple pattern works as expected, so that
>>>>>>> too many open connections shouldn't be an issue most likely.
>>>>>>>
>>>>>>> Can you reproduce this behavior with newer Jena versions like 4.6.1?
>>>>>>>
>>>>>>> Or can you reproduce this on different servers as well?
>>>>>>>
>>>>>>> Is it also stuck of your run the query directly after you restart
>>>>>>> Fuseki?
>>>>>>>
>>>>>>>
>>>>>>> On 19.09.22 13:49, Mikael Pesonen wrote:
>>>>>>>> On 15/09/2022 17.48, Lorenz Buehmann wrote:
>>>>>>>>> Forgot:
>>>>>>>>>
>>>>>>>>> - size of result for each triple pattern? Might affect if hash
>>>>>>>>> join can be used.
>>>>>>>> It's one row for each.
>>>>>>>>> - your hardware?
>>>>>>>> Normal server with 16gigs mem.
>>>>>>>>> - is it just the first query after starting Fuseki? Connections
>>>>>>>>> have been closed? Note, there was also a bug in a recent Jena
>>>>>>>>> version, but only with TDB and too many open connections. It has
>>>>>>>>> been resolved with release 4.6.1.
>>>>>>>> Jena has been running quite a while.
>>>>>>>>> Might not be related, but I'm mentioning all things here
>>>>>>>>> nevertheless.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 15.09.22 11:16, Mikael Pesonen wrote:
>>>>>>>>>> This returns one row fast, say :C1
>>>>>>>>>>
>>>>>>>>>> SELECT *
>>>>>>>>>> FROM <https://a.b.c>
>>>>>>>>>> WHERE {
>>>>>>>>>>     <https://x.y.z> a ?t .
>>>>>>>>>>     #?t skos:prefLabel ?l
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> and this too:
>>>>>>>>>>
>>>>>>>>>> SELECT *
>>>>>>>>>> FROM <https://a.b.c>
>>>>>>>>>> WHERE {
>>>>>>>>>>     #<https://x.y.z> a ?t .
>>>>>>>>>>     :C1 skos:prefLabel ?l
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But this always hangs until timeout
>>>>>>>>>>
>>>>>>>>>> SELECT *
>>>>>>>>>> FROM <https://a.b.c>
>>>>>>>>>> WHERE {
>>>>>>>>>>     <https://x.y.z> a ?t .
>>>>>>>>>>     ?t skos:prefLabel ?l
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> What am I missing here? I'm using Fuseki web GUI. Thanks!
>>> --
>>> Lingsoft - 30 years of Leading Language Management
>>>
>>> www.lingsoft.fi<http://www.lingsoft.fi>
>>>
>>> Speech Applications - Language Management - Translation - Reader's
>>> and Writer's Tools - Text Tools - E-books and M-books
>>>
>>> Mikael Pesonen
>>> System Engineer
>>>
>>> e-mail: [email protected]
>>> Tel. +358 2 279 3300
>>>
>>> Time zone: GMT+2
>>>
>>> Helsinki Office
>>> Eteläranta 10
>>> FI-00130 Helsinki
>>> FINLAND
>>>
>>> Turku Office
>>> Kauppiaskatu 5 A
>>> FI-20100 Turku
>>> FINLAND
>>>

--
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi<http://www.lingsoft.fi>

Speech Applications - Language Management - Translation - Reader's and Writer's 
Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: [email protected]
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: Weird sparql problem

Reply via email to