SPARQL performance (new to the tech)

Lorenz Buehmann Thu, 20 May 2021 04:56:50 -0700

wouldn't you use a nested OPTIONAL here anyways?

If it is just one triple in the optional is less likely to be bad butif the query uses the variable unbound later on, there will be a verylarge number of results, many duplicates and not actually related tothe ?paper. I am guessing but I would be surprised is your query hasvariants of this and it is hidden by the "distinct".


This is the problem at:

>> ---
>>       OPTIONAL {
>>           ?author iospress:contributorAffiliation ?affiliation.
>>           ?affiliation rdfs:label ?university;
>>       }
>>        OPTIONAL {
>>         ?affiliation iospress:geocodingOutput ?geocoded.
>>         ?geocoded iospress-geocode:country ?country
>>       }
>> ---

If no ?affiliation, then the second OPTIONAL is over the wholedatabase which I'm guess is many results.


    Andy

this will de facto remove all papers without access from the set.This issomething I don't want (I want them in the list, just with an emptyvalue

there) - and my understanding is that the way to manage this is an

Optional. Is this correct? Is there a "better" way? If this ends upbeing

costly, I could also check to actually have a value for those (those
without value are technically "Closed").

Something I was wondering also is whether it makes sense to split the

fields I need for search/filtering vs the ones I want to see on theresult.I've a feeling that in theory I could play with two queries - onewith only

the params I need for the filtering, then play something similar to
DESCRIBE on each record on the filtered set - but I've no idea if this
would be more performant than keeping it together as it is now.

Anyway, the exchanges here are much appreciated!

On Tue, 18 May 2021 at 19:18, Andy Seaborne <[email protected]> wrote:

Martin,

That's a complicated query and I haven't got my head aroud itcompletely

yet.

There are some useful points to understand:

A::

What is the time and outcome of these queries that focus on the main
data location part:

1/

SELECT (count(*) AS ?C) {
   ?paper  iospress:publicationDate ?pubDate
   FILTER(...date test...)
}

2/
   SELECT (count(*) AS ?C) {
   ?paper  iospress:publicationDate ?pubDate
           iospress:publicationIncludesKeyword ?keyword .
   FILETER (...date... && (regex (?keyword, "sickness", "i"))

3/
SELECT (count(*) AS ?C) {
    {?paper rdf:type iospress:Chapter.}
              union
    {?paper rdf:type iospress:Article.}
    ?paper  iospress:publicationDate ?pubDate
    FILTER(...date test))
}

4/
SELECT (count(*) AS ?C) {
   ?paper  iospress:publicationDate ?pubDate
   FILTER(.. date test...)
    {?paper rdf:type iospress:Chapter.}
              union
    {?paper rdf:type iospress:Article.}
}

B::

then is it the case that some optionals have more effect than others?
Some are "high risk":

---
      OPTIONAL {
          ?author iospress:contributorAffiliation ?affiliation.
          ?affiliation rdfs:label ?university;
      }
       OPTIONAL {
        ?affiliation iospress:geocodingOutput ?geocoded.
        ?geocoded iospress-geocode:country ?country
      }
---
Suppose the first does not match then the second is a lot of results
unrelated to ?paper.

C::

distinct

it might be worth trying without distinct because distinct can cause a
lot of results to be reduced to just a few, hiding redundant work.

      Andy

On 18/05/2021 13:31, Martin Van Aken wrote:

Hello again,
After some more days of me trying to get a better performance & the
approval of my company, here is what we try to run (query at thebottom

of

the mail).

For some context:
- This is a search for academia papers. Papers have multipleauthors, and
authors are part of multiple universities. Papers also have multiple
keywords and are generally part of a set (an issue) itself part ofa set

(a

volume) itself part of a set (a journal).
- Our goal is to have a multicriteria search front end, so thequery isgenerated from a form with clauses selected by the user. Thestructure is
always the same, this example use a single condition on the "keyword"
- The set of data is relatively small - around 150k papers (soprobably

1M

triples there), probably around 500k authors
- We use group/concat as we want to give as results one line per paper

(vs

having one per paper per keyword for example)
- I've read OPTIONALS are pretty bad - but I've no real alternativeherethat I know off when some fields can be present or not and I don'twant

to

throw away all that miss at least one
For our current results, all but the most precise queries (gettinginto a
super limited set of papers, like <10) get extremely slow (easily to

dozens

of seconds, sometimes more). I feel that there is something obviousthatI'm missing, either in the query or my Jena config. The server ison an

old

version but I make my tests locally on a 4.0.0 "out of the box" (0
configuration).

What I've tried:

- Removing the ORDER does not impact much
- Removing most optionals works... but remove the point of the query

- Using contains instead of regex does not impact much (I've thegoal to

use Jena/Lucene integration for everything text related)

I'm really in for an opinion as taking my RDBMS background this is the

equivalent of less than 3M records split on around 8 tables -something

that should be queryable mostly in sub second times.

Any feedback is most welcome !

Martin

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
      PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
      PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

      SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access

(group_concat(distinct ?authorName;separator=", ") as?Authors) (group_concat(distinct ?keyword;separator=", ") as?keywords)

          (group_concat(distinct ?university;separator=", ") as

?universities)

(group_concat(distinct ?country;separator=", ") as?countries)

      WHERE {
          {?paper rdf:type iospress:Chapter.}
              union
          {?paper rdf:type iospress:Article.}

          ?paper rdfs:label ?title;
                   rdf:type ?type;

                   iospress:publicationDate ?pubDate;
                   iospress:publicationAbstract ?abstract;

                   iospress:publicationIncludesKeyword ?keyword;
                   iospress:publicationAuthorList [?idx ?author].

          ?issueOrBook iospress:partOf ?volumeOrSerie.
          ?paper iospress:partOf ?issueOrBook.


      OPTIONAL {
          ?issueOrBook iospress:isbn ?bookIsbn.
      }
      OPTIONAL {
          ?paper iospress:publicationDoiUrl ?doi.
      }
      OPTIONAL {
          ?author rdfs:label ?authorName.
      }
      OPTIONAL {
          ?author iospress:contributorAffiliation ?affiliation.
          ?affiliation rdfs:label ?university;
      }
       OPTIONAL {
        ?affiliation iospress:geocodingOutput ?geocoded.
        ?geocoded iospress-geocode:country ?country
      }
      OPTIONAL {
          ?paper iospress:publicationAccessibility ?access.
      }
      OPTIONAL {
          ?volumeOrSerie iospress:partOf ?journal;
      }
      FILTER(
          (

(datatype(?pubDate) = xsd:date &&xsd:dateTime(?pubDate) >

"1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
"2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
              (datatype(?pubDate) = xsd:gYear && ?pubDate >=
"2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
          )

          && (regex (?keyword, "sickness", "i"))
          )
      }
      GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access

      ORDER BY ?pubDate ?paper
      LIMIT 50


On Thu, 6 May 2021 at 20:10, Andy Seaborne <[email protected]> wrote:

Hi there,

Showing the query would be helpful but some general remarks:
1/ If the query or the setup for Fuseki is needing more than thedefaultheap size, then it might be that the Java JVM is getting into astate ofheap exhaustion. This manifests as the CPU loading getting veryhigh. It
will seem like nothing is happening (waiting for response).

2/ The query may be expensive.

Things to look for
* cross products - two parts of the query pattern that are not
connected.

{ ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.

* sort, spilling to disk or combined with a cross product the query.
3/ If no results are coming back, then the query is form that doesnot
stream back - sort, or CONSTRUCT maybe.
There was a useful presentation recently that talks about theprinciples
of query efficiency.

SPARQL Query Optimization with Pavel Klinov
https://www.youtube.com/watch?v=16eMswT2x2Y

More inline:

On 06/05/2021 09:54, Martin Van Aken wrote:
Hi!
I'm Martin, I'm a software developer new to the Triples/SPARQLworld.

I'm

currently building queries against a Fuseki/TDB backend (that I can

work

on
too) and I'm getting into significant performance problems(including
never
ending queries).
Are updates also happening at the same time?
Despite what I thought was a good search on the apache
jena website I could not find a lot of insight about performance
investigation so I'm trying it here.
Most of my data experience comes from the relational world (ex:PG) so
I'm
sometimes drawing comparisons there.
To give some context my data set is around 15 linked concepts,with thenumber of triples for each ranging from some hundreds to 500K -total
less
than 2 millions (documents/authors/publication kind of data).

Unto questions:

      - When I'm facing a slow query, what are my investigation

options. Is

there an equivalent of an "explain plan" in SQL pointing tothe

query

      specific slow points? What's the advised way for performance

checks

in
      SPARQL?
qparse --print=opt --file query.rq
- Are there any performance setups to be aware of on theserver

side?

Like ways to check indexes are correctly built (outside oftext

search that

      I'm not working with for the moment)
      - We're currently using TDB1. I've seen the transactional

benefits of

TDB2 - are there performance improvements too that wouldwarrant a
      migration there ?
Not on the query side.

       Andy
Thanks a lot already!

Martin

Re: Re: Jena / Fuseki / SPARQL performance (new to the tech)

Reply via email to