Re: Weird sparql problem

Mikael Pesonen Tue, 15 Nov 2022 03:13:30 -0800

I'm finding now many similar cases where basic queries just won't work.Maybe we have reached the max size of db (77gb on disk)? Or should anyserious sparql/triple store user just learn how to optimize queries?Which is different from MySQL, for example.


For example this also hangs (or is slow)

(?lit) text:query (skosxl:literalForm  "\"fever\"" "lang:en" ) .
?c skosxl:prefLabel|altLabel [skosxl:literalForm ?lit]

and this works
(?lit) text:query (skosxl:literalForm  "\"fever\"" "lang:en" ) .
{ ?c skosxl:prefLabel [skosxl:literalForm ?lit] }
UNION
{ ?c skosxl:altLabel [skosxl:literalForm ?lit] }

On 09/11/2022 11.43, [email protected] wrote:

TL;DR

It’s a workaround for this issue because it can force the optimiser to behave 
differently, however it should be used sparingly as overuse may prevent other 
optimisations that may yield more benefit than you lose elsewhere.

See Andy’s recent email [1] that the offending optimisation will be disabled by 
default in future releases so the workaround will not be needed longer term.

----

The long-winded details for those interested (although I’m still glossing over 
lots of low level details)…

There are two levels of query optimisation in ARQ:


   1.  Logical optimisation (sometimes referred to as algebra optimisation)
   2.  Execution optimisation

The logical optimiser works at the SPARQL Algebra level and looks to make 
transformations to the algebra that are known to improve performance based on 
experience, past research etc.  In doing so the optimiser has to ensure that 
those transformations are semantically safe, i.e., they MUST NOT change the 
overall semantics of the query and result in the same answers as the original 
query.  Therefore, many of these optimisations are applied quite conservatively 
so if ARQ cannot determine that a given transformation would be semantically 
dependent it won’t apply it.

Additionally in some cases, these optimisations are specifically intended to be 
chained, i.e., doing one optimisation may enable further optimisations, thus 
the ARQ optimiser applies the various transformations in a specific order.  See 
https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize/OptimizerStd.java
 if you want to see the raw details of this and some explanatory comments 
around the ordering.  Some of these logical transformations are also done to 
enable execution optimiser behaviour later in query evaluation e.g., join 
linearisation.

The downside of the logical optimiser is that it works purely by static 
analysis of the algebra i.e., without reference to the dataset against the 
query will ultimately evaluate.  This means that sometimes it can make 
decisions that are good for the general case BUT bad for some datasets.

The execution optimiser is a whole bunch of things done during actual execution 
to improve performance.  This includes everything from Jena’s streaming-based 
iterator implementation of query execution (effectively a Volcano based 
evaluation model [2]), ARQ’s join linearisation operators, TDB’s low level Node 
IDs and direct expression evaluation over those, Node ID to RDF Term caching 
(and vice-versa), memory-mapping of database indices etc.

Execution also includes BGP reordering, when the query evaluator gets a BGP to 
evaluate it can choose to apply reordering to the triple patterns within that 
BGP.  For TDB this is controlled by the presence, or lack thereof, of a 
stats/fixed/none.opt file in the database directory.  Having a relevant file 
present should apply further execution time BGP reordering that can be 
statistics aware and thus avoids the issue of the logical BGP reordering.  
Since during execution of a single BGP bindings from earlier triple patterns 
are used to restrict the searches made for subsequent triple patterns the order 
of execution of the triple patterns can be important, especially if one triple 
pattern has many matches.

However, if you are querying an in-memory dataset instead of TDB, then you may 
not be getting any execution time BGP reordering so you’re left evaluating the 
triple patterns according to the logical BGP reordering that may turn out to be 
sub-optimal depending on your dataset.

---

The specific problem discussed in this thread is due to a new optimisation that 
was introduced in Jena 4.5.0 (BGP Reordering during logical optimisation), this 
was an optimisation that was shown to improve performance on some benchmarks as 
it enables more aggressive application of another optimisation (filter 
placement).

The reason it did the opposite on some users’ dataset is that it is done 
without any knowledge of the data (as it’s a logical optimisation) and can 
result in breaking up a BGP into separate BGPs causing less specific triple 
patterns to be evaluated prior to more specific ones.  Or where no execution 
time BGP reordering occurs can leave the triple patterns in a sub-optimal order 
for evaluation even if BGPs are not split.

The short-term fix (again see Andy’s email [1]) is to disable this optimisation 
by default, users can opt back into it if they find it benefits their usage of 
Jena on their datasets.

The long-term fix is probably to rearchitect the logical optimiser in some way 
to allow more data context to be visible to it i.e., making the logical BGP 
reordering statistics aware, making ARQ’s overall optimisation strategy more 
hybrid.  If anyone is interested, I’d imagine there’ll be a thread on this on 
the dev list soon

Hope this helps,

Rob

[1]: https://lists.apache.org/thread/37cloogcb3wzmkl0s33ttnxyg0kvq69p
[2]: http://daslab.seas.harvard.edu/reading-group/papers/volcano.pdf


From: Mikael Pesonen <[email protected]>
Date: Tuesday, 8 November 2022 at 11:04
To: [email protected] <[email protected]>
Subject: Re: Weird sparql problem
Both your suggestions for rewriting the query worked. I'm lost with the
reasons, but for future cases, breaking problematic queries with {} is
they way to go?

On 04/11/2022 11.25, [email protected] wrote:

So yes as suspected the triple patterns are being reordered badly in the BGP:

    (sequence
      (table (vars ?sct_code)
        (row [?sct_code "298314008"])
      )
      (bgp
        (triple ?c skos:inScheme lsu:SNOMEDCT_US)
        (triple ?c skosxl:prefLabel ??0)
        (triple ??0 lsu:code ?sct_code)
      )))

The optimizer doesn’t take into account the fact that the ?sct_code variable is 
going to be bound by the VALUES clause (table in the algebra) so considers that 
the least specific triple pattern (as it has two variables) causing it to 
evaluate a much less specific triple pattern first.

Lorenz’s suggestion of generating statistics for your dataset is a good one, 
statistics would likely guide the optimiser that the ?c skos:inScheme 
lsu:SNOMEDCT_US triple is actually very non-specific for your dataset.

You could also try Andy’s suggestion else-thread i.e. --set 
arq:optReorderBGP=false passed to the CLI command in question, or if this is 
being called from code ARQ.getContext().set(ARQ.optReorderBGP, false);

The other thing you can do is explicitly break up your query further i.e.

{ VALUES ?sct_code { "298314008" }
    {  _:b0  lsu:code          ?sct_code .
      ?c    skosxl:prefLabel  _:b0 . }
    {  ?c    skos:inScheme     lsu:SNOMEDCT_US }
    }

Essentially forcing the engine to evaluate that very unspecific triple pattern 
last

Another possibility would be to change that triple pattern to be in a FILTER 
EXISTS condition, so it’d only be evaluated for matches to your other triple 
patterns i.e.

{ VALUES ?sct_code { "298314008" }
      _:b0  lsu:code          ?sct_code .
      ?c    skosxl:prefLabel  _:b0 .
     FILTER EXISTS {  ?c    skos:inScheme     lsu:SNOMEDCT_US }
    }

Hope this helps,

Rob

From: Lorenz Buehmann <[email protected]>
Date: Thursday, 3 November 2022 at 11:12
To: [email protected] <[email protected]>
Subject: Re: Re: Weird sparql problem
tdbquery --explain --loc  $TDB_LOC  "query here"

would also work to see the plan - maybe also increase log level to see
more: https://jena.apache.org/documentation/tdb/optimizer.html

Another question, did you generate the TDB stats such those could be
used by the optimizer?

for debugging purpose, you could also disable query optimization (put an
empty none.opt file into $TDB_LOC/Data-0001 dir)  and reorder your query
manually, i.e.

WHERE
    { VALUES ?sct_code { "298314008" }
    _:b0  lsu:code          ?sct_code .
      ?c    skosxl:prefLabel  _:b0 .
      ?c    skos:inScheme     lsu:SNOMEDCT_US
    }

without stats and based on heuristics (e.g. number of variables in
triple pattern), otherwise the last triple pattern might always be
evaluated first


On 03.11.22 11:11, Mikael Pesonen wrote:

Here's the parse, hope it helps:

WHERE
    { VALUES ?sct_code { "298314008" }
      ?c    skosxl:prefLabel  _:b0 .
      _:b0  lsu:code          ?sct_code .
      ?c    skos:inScheme     lsu:SNOMEDCT_US
    }
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(prefix ((owl: <http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl>>)
           (rdf: 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>>)
           (skosxl: 
<http://www.w3.org/2008/05/skos-xl#<http://www.w3.org/2008/05/skos-xl>>)
           (skos: 
<http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core>>)
           (dcterms: <http://purl.org/dc/terms/>)
           (rdfs: 
<http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema>>)
           (lsr: <https://resource.lingsoft.fi/>)
           (id: <http://snomed.info/id/>)
           (dcat: <http://www.w3.org/ns/dcat#<http://www.w3.org/ns/dcat>>)
           (dc: <http://purl.org/dc/elements/1.1/>)
           (lsu: <https://www.lingsoft.fi/ns/umls/>))
    (sequence
      (table (vars ?sct_code)
        (row [?sct_code "298314008"])
      )
      (bgp
        (triple ?c skos:inScheme lsu:SNOMEDCT_US)
        (triple ?c skosxl:prefLabel ??0)
        (triple ??0 lsu:code ?sct_code)
      )))


On 02/11/2022 12.32, [email protected] wrote:

For these kind of performance issues it is useful to see the SPARQL
algebra for the whole query, not just fragments of the query.  You
can use the qparse command for the version of Jena you are using to
see how it is optimising your queries e.g.

qparse --explain --query example.rq

As Lorenz suggests this may be the optimiser making a bad guess at
the appropriate order in which to evaluate the triple patterns within
the BGP but without the larger query context or the algebra all we
can do is guess.

Rob

From: Mikael Pesonen <[email protected]>
Date: Tuesday, 1 November 2022 at 12:53
To: [email protected] <[email protected]>
Subject: Re: Weird sparql problem
Diferent case, but again hanging makes no sense to user, whatever are
the technical reasons.

     VALUES ?sct_code { "298314008" }
       ?c skosxl:prefLabel [ lsu:code ?sct_code ]

returns one row immediately, but

     VALUES ?sct_code { "298314008" }
       ?c skosxl:prefLabel [ lsu:code ?sct_code ]; skos:inScheme
lsu:SNOMEDCT_US

hangs forever


     skos:inScheme lsu:SNOMEDCT_US;

On 18/10/2022 9.08, Lorenz Buehmann wrote:

Hi,

comments inline

On 17.10.22 14:35, Mikael Pesonen wrote:

This works as a separate query, but not in a the middle, since ?s
gets new values instead of binding to previous ?s.

{ select ?t where {
?s a ?t .
    } limit 10}
     ?t skos:prefLabel ?l

In the middle of what? Subqueries will be evaluated first - if you
really want labels for classes, you should use a DISTINCT in the
subquery such that the intermediate result is small, there shouldn't
be that many classes, but many instances with the same class, thus,
the join would be more expensive than necessary.

On 17/10/2022 14.56, Mikael Pesonen wrote:

?s a ?t .
     ?t skos:prefLabel ?l

returns 3 million triples. Maybe it's related to this?

I don't see how this should be related to  your initial query where ?s
was bound, which in my opinion should be an easy join. Is it possible
for you to share the dataset somehow? Also, what you can do is to
compute statistics for the TDB database with tdbstats tool [1] from
commandline and put it into the TDB folder. But even without the query
plan should take the first triple pattern, use the spo index as s and
p are bound, then pass the bindings of ?o to the evaluation of the
second triple pattern

[1]
https://jena.apache.org/documentation/tdb/optimizer.html#generating-a-statistics-file

On 21/09/2022 9.15, Lorenz Buehmann wrote:

Weird, only 10M triples and each triple pattern returns only 1
binding, thus, the size is tiny - honestly I can't think of
anything except for open connections, but as you mentioned, running
the queries with only one triple pattern works as expected, so that
too many open connections shouldn't be an issue most likely.

Can you reproduce this behavior with newer Jena versions like 4.6.1?

Or can you reproduce this on different servers as well?

Is it also stuck of your run the query directly after you restart
Fuseki?


On 19.09.22 13:49, Mikael Pesonen wrote:

On 15/09/2022 17.48, Lorenz Buehmann wrote:

Forgot:

- size of result for each triple pattern? Might affect if hash
join can be used.

It's one row for each.

- your hardware?

Normal server with 16gigs mem.

- is it just the first query after starting Fuseki? Connections
have been closed? Note, there was also a bug in a recent Jena
version, but only with TDB and too many open connections. It has
been resolved with release 4.6.1.

Jena has been running quite a while.

Might not be related, but I'm mentioning all things here
nevertheless.


On 15.09.22 11:16, Mikael Pesonen wrote:

This returns one row fast, say :C1

SELECT *
FROM <https://a.b.c>
WHERE {
     <https://x.y.z> a ?t .
     #?t skos:prefLabel ?l
}


and this too:

SELECT *
FROM <https://a.b.c>
WHERE {
     #<https://x.y.z> a ?t .
     :C1 skos:prefLabel ?l
}


But this always hangs until timeout

SELECT *
FROM <https://a.b.c>
WHERE {
     <https://x.y.z> a ?t .
     ?t skos:prefLabel ?l
}

What am I missing here? I'm using Fuseki web GUI. Thanks!

Re: Weird sparql problem

Reply via email to