SPARQL performance (new to the tech)

Martin Van Aken Tue, 18 May 2021 05:32:36 -0700

Hello again,
After some more days of me trying to get a better performance & the
approval of my company, here is what we try to run (query at the bottom of
the mail).


For some context:

- This is a search for academia papers. Papers have multiple authors, and
authors are part of multiple universities. Papers also have multiple
keywords and are generally part of a set (an issue) itself part of a set (a
volume) itself part of a set (a journal).
- Our goal is to have a multicriteria search front end, so the query is
generated from a form with clauses selected by the user. The structure is
always the same, this example use a single condition on the "keyword"
- The set of data is relatively small - around 150k papers (so probably 1M
triples there), probably around 500k authors
- We use group/concat as we want to give as results one line per paper (vs
having one per paper per keyword for example)
- I've read OPTIONALS are pretty bad - but I've no real alternative here
that I know off when some fields can be present or not and I don't want to
throw away all that miss at least one

For our current results, all but the most precise queries (getting into a
super limited set of papers, like <10) get extremely slow (easily to dozens
of seconds, sometimes more). I feel that there is something obvious that
I'm missing, either in the query or my Jena config. The server is on an old
version but I make my tests locally on a 4.0.0 "out of the box" (0
configuration).

What I've tried:

- Removing the ORDER does not impact much
- Removing most optionals works... but remove the point of the query
- Using contains instead of regex does not impact much (I've the goal to
use Jena/Lucene integration for everything text related)

I'm really in for an opinion as taking my RDBMS background this is the
equivalent of less than 3M records split on around 8 tables - something
that should be queryable mostly in sub second times.

Any feedback is most welcome !

Martin

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
    PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
    PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

    SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
        (group_concat(distinct ?authorName;separator=", ") as ?Authors)
        (group_concat(distinct ?keyword;separator=", ") as ?keywords)
        (group_concat(distinct ?university;separator=", ") as ?universities)
        (group_concat(distinct ?country;separator=", ") as ?countries)
    WHERE {
        {?paper rdf:type iospress:Chapter.}
            union
        {?paper rdf:type iospress:Article.}

        ?paper rdfs:label ?title;
                 rdf:type ?type;

                 iospress:publicationDate ?pubDate;
                 iospress:publicationAbstract ?abstract;

                 iospress:publicationIncludesKeyword ?keyword;
                 iospress:publicationAuthorList [?idx ?author].

        ?issueOrBook iospress:partOf ?volumeOrSerie.
        ?paper iospress:partOf ?issueOrBook.


    OPTIONAL {
        ?issueOrBook iospress:isbn ?bookIsbn.
    }
    OPTIONAL {
        ?paper iospress:publicationDoiUrl ?doi.
    }
    OPTIONAL {
        ?author rdfs:label ?authorName.
    }
    OPTIONAL {
        ?author iospress:contributorAffiliation ?affiliation.
        ?affiliation rdfs:label ?university;
    }
     OPTIONAL {
      ?affiliation iospress:geocodingOutput ?geocoded.
      ?geocoded iospress-geocode:country ?country
    }
    OPTIONAL {
        ?paper iospress:publicationAccessibility ?access.
    }
    OPTIONAL {
        ?volumeOrSerie iospress:partOf ?journal;
    }
    FILTER(
        (
            (datatype(?pubDate) = xsd:date && xsd:dateTime(?pubDate) >
"1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
"2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
            (datatype(?pubDate) = xsd:gYear && ?pubDate >=
"2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
        )

        && (regex (?keyword, "sickness", "i"))
        )
    }
    GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access

    ORDER BY ?pubDate ?paper
    LIMIT 50


On Thu, 6 May 2021 at 20:10, Andy Seaborne <[email protected]> wrote:

> Hi there,
>
> Showing the query would be helpful but some general remarks:
>
> 1/ If the query or the setup for Fuseki is needing more than the default
> heap size, then it might be that the Java JVM is getting into a state of
> heap exhaustion. This manifests as the CPU loading getting very high. It
> will seem like nothing is happening (waiting for response).
>
> 2/ The query may be expensive.
>
> Things to look for
> * cross products - two parts of the query pattern that are not
> connected.
>
> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
>
> * sort, spilling to disk or combined with a cross product the query.
>
> 3/ If no results are coming back, then the query is form that does not
> stream back - sort, or CONSTRUCT maybe.
>
> There was a useful presentation recently that talks about the principles
> of query efficiency.
>
> SPARQL Query Optimization with Pavel Klinov
> https://www.youtube.com/watch?v=16eMswT2x2Y
>
> More inline:
>
> On 06/05/2021 09:54, Martin Van Aken wrote:
> > Hi!
> > I'm Martin, I'm a software developer new to the Triples/SPARQL world. I'm
> > currently building queries against a Fuseki/TDB backend (that I can work
> on
> > too) and I'm getting into significant performance problems (including
> never
> > ending queries).
>
> Are updates also happening at the same time?
>
> > Despite what I thought was a good search on the apache
> > jena website I could not find a lot of insight about performance
> > investigation so I'm trying it here.
> >
> > Most of my data experience comes from the relational world (ex: PG) so
> I'm
> > sometimes drawing comparisons there.
> >
> > To give some context my data set is around 15 linked concepts, with the
> > number of triples for each ranging from some hundreds to 500K - total
> less
> > than 2 millions (documents/authors/publication kind of data).
> >
> > Unto questions:
> >
> >     - When I'm facing a slow query, what are my investigation options. Is
> >     there an equivalent of an "explain plan" in SQL pointing to the query
> >     specific slow points? What's the advised way for performance checks
> in
> >     SPARQL?
>
> qparse --print=opt --file query.rq
>
> >     - Are there any performance setups to be aware of on the server side?
> >     Like ways to check indexes are correctly built (outside of text
> search that
> >     I'm not working with for the moment)
> >     - We're currently using TDB1. I've seen the transactional benefits of
> >     TDB2 - are there performance improvements too that would warrant a
> >     migration there ?
>
> Not on the query side.
>
>      Andy
>
> >
> > Thanks a lot already!
> >
> > Martin
> >
>


-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : [email protected]
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Reply via email to