SPARQL performance (new to the tech)

Martin Van Aken Tue, 25 May 2021 00:33:32 -0700

Great, thanks for the fast and precise answers, will try this.

Martin


On Tue, 25 May 2021 at 09:25, Martynas Jusevičius <[email protected]>
wrote:

> You get Turtle data because CONSTRUCT/DESCRIBE forms return a graph
> while ASK/SELECT return a tabular result set.
>
> You can try Accept: application/ld+json request header in order to get
> JSON-LD data: https://www.w3.org/TR/json-ld11/
>
> If you need a connected node in the description, you'll need to add a
> pattern that leads to it and then add the variable to DESCRIBE.
>
> On Tue, May 25, 2021 at 9:11 AM Martin Van Aken <[email protected]>
> wrote:
> >
> > Hello again,
> > Thanks Steve & Lorenz - I'll have a look at nested optionals (did not
> > realize that was a thing).
> >
> > I've made tests with DESCRIBE and this seems to be the way to go - I've
> the
> > major performance improvement I needed (like 10x). This leaves me with
> two
> > more questions:
> >
> > - It seems that DESCRIBE always returns some kind of TTL format - is
> there
> > a hidden way to get JSON (like for a SELECT) query or is this by design?
> > It's not blocking but would mean some parsing of the results
> > - It seems DESCRIBE (in Jena, as I understood this is implementation
> > dependent) limited to the object itself (i.e. all objects linked to a
> > specific subject). This works for most of my needs, but I've some related
> > data I want to get too - what's the way there? Make a secondary query to
> > get those (ex: I'll get papers back, but papers are linked to authors
> that
> > are working in universities and I'd need those too)? If I do so and want
> to
> > avoid a "SELECT N+1" kind of problem (sending a secondary query per
> record)
> > is there some kind of "WHERE ?paper IN (..., ..., ...)" or do I just play
> > with OR clauses?
> >
> > Thanks again, this ML is having a huge impact on my knowledge & the
> linked
> > data project I'm working on, this is much appreciated.
> >
> > Martin
> >
> > On Thu, 20 May 2021 at 15:34, Steve Vestal <
> [email protected]>
> > wrote:
> >
> > > Andy pointed at sequential OPTIONALs.  One example I have seen had
> > > nested OPTIONAL clauses to address a performance issue.  Might that be
> > > helpful here?
> > >
> > > On 5/20/2021 5:43 AM, Andy Seaborne wrote:
> > > >
> > > >
> > > > On 20/05/2021 09:36, Martin Van Aken wrote:
> > > >> Andy,
> > > >> A big thanks for this - it gives me some paths to explore. I think
> > > >> indeed
> > > >> my biggest problems are in the optional parts - I'll run the test
> you
> > > >> advised and also look in which case I may be able to get rid of the
> > > >> optionals to avoid those situations that could lead to a big amount
> of
> > > >> results as you mentioned. I'm already looking at getting my filters
> > > >> closer
> > > >> to definition - can this be done for things other than pure equality
> > > >> (for
> > > >> example for the date that are testing for a range?).
> > > >>
> > > >> Maybe one question about optional - I use them in some cases to
> avoid
> > > >> empty
> > > >> results. An example is Access - some paper have an Access triple
> > > >> (Open or
> > > >> Closed) - but some have none. My understanding is that if I make a
> link
> > > >> without optional like:
> > > >>
> > > >> ?paper iospress:accessibility ?access
> > > >
> > > > If it is just one triple in the optional is less likely to be bad but
> > > > if the query uses the variable unbound later on, there will be a very
> > > > large number of results, many duplicates and not actually related to
> > > > the ?paper. I am guessing but I would be surprised is your query has
> > > > variants of this and it is hidden by the "distinct".
> > > >
> > > > This is the problem at:
> > > >
> > > > >> ---
> > > > >>       OPTIONAL {
> > > > >>           ?author iospress:contributorAffiliation ?affiliation.
> > > > >>           ?affiliation rdfs:label ?university;
> > > > >>       }
> > > > >>        OPTIONAL {
> > > > >>         ?affiliation iospress:geocodingOutput ?geocoded.
> > > > >>         ?geocoded iospress-geocode:country ?country
> > > > >>       }
> > > > >> ---
> > > >
> > > > If no ?affiliation, then the second OPTIONAL is over the whole
> > > > database which I'm guess is many results.
> > > >
> > > >     Andy
> > > >
> > > >> this will de facto remove all papers without access from the set.
> > > >> This is
> > > >> something I don't want (I want them in the list, just with an empty
> > > >> value
> > > >> there) - and my understanding is that the way to manage this is an
> > > >> Optional. Is this correct? Is there a "better" way? If this ends up
> > > >> being
> > > >> costly, I could also check to actually have a value for those (those
> > > >> without value are technically "Closed").
> > > >>
> > > >> Something I was wondering also is whether it makes sense to split
> the
> > > >> fields I need for search/filtering vs the ones I want to see on the
> > > >> result.
> > > >> I've a feeling that in theory I could play with two queries - one
> > > >> with only
> > > >> the params I need for the filtering, then play something similar to
> > > >> DESCRIBE on each record on the filtered set - but I've no idea if
> this
> > > >> would be more performant than keeping it together as it is now.
> > > >>
> > > >> Anyway, the exchanges here are much appreciated!
> > > >>
> > > >> On Tue, 18 May 2021 at 19:18, Andy Seaborne <[email protected]>
> wrote:
> > > >>
> > > >>> Martin,
> > > >>>
> > > >>> That's a complicated query and I haven't got my head aroud it
> > > >>> completely
> > > >>> yet.
> > > >>>
> > > >>> There are some useful points to understand:
> > > >>>
> > > >>> A::
> > > >>>
> > > >>> What is the time and outcome of these queries that focus on the
> main
> > > >>> data location part:
> > > >>>
> > > >>> 1/
> > > >>>
> > > >>> SELECT (count(*) AS ?C) {
> > > >>>    ?paper  iospress:publicationDate ?pubDate
> > > >>>    FILTER(...date test...)
> > > >>> }
> > > >>>
> > > >>> 2/
> > > >>>    SELECT (count(*) AS ?C) {
> > > >>>    ?paper  iospress:publicationDate ?pubDate
> > > >>>            iospress:publicationIncludesKeyword ?keyword .
> > > >>>    FILETER (...date... && (regex (?keyword, "sickness", "i"))
> > > >>>
> > > >>> 3/
> > > >>> SELECT (count(*) AS ?C) {
> > > >>>     {?paper rdf:type iospress:Chapter.}
> > > >>>               union
> > > >>>     {?paper rdf:type iospress:Article.}
> > > >>>     ?paper  iospress:publicationDate ?pubDate
> > > >>>     FILTER(...date test))
> > > >>> }
> > > >>>
> > > >>> 4/
> > > >>> SELECT (count(*) AS ?C) {
> > > >>>    ?paper  iospress:publicationDate ?pubDate
> > > >>>    FILTER(.. date test...)
> > > >>>     {?paper rdf:type iospress:Chapter.}
> > > >>>               union
> > > >>>     {?paper rdf:type iospress:Article.}
> > > >>> }
> > > >>>
> > > >>> B::
> > > >>>
> > > >>> then is it the case that some optionals have more effect than
> others?
> > > >>> Some are "high risk":
> > > >>>
> > > >>> ---
> > > >>>       OPTIONAL {
> > > >>>           ?author iospress:contributorAffiliation ?affiliation.
> > > >>>           ?affiliation rdfs:label ?university;
> > > >>>       }
> > > >>>        OPTIONAL {
> > > >>>         ?affiliation iospress:geocodingOutput ?geocoded.
> > > >>>         ?geocoded iospress-geocode:country ?country
> > > >>>       }
> > > >>> ---
> > > >>> Suppose the first does not match then the second is a lot of
> results
> > > >>> unrelated to ?paper.
> > > >>>
> > > >>> C::
> > > >>>
> > > >>> distinct
> > > >>>
> > > >>> it might be worth trying without distinct because distinct can
> cause a
> > > >>> lot of results to be reduced to just a few, hiding redundant work.
> > > >>>
> > > >>>       Andy
> > > >>>
> > > >>> On 18/05/2021 13:31, Martin Van Aken wrote:
> > > >>>> Hello again,
> > > >>>> After some more days of me trying to get a better performance &
> the
> > > >>>> approval of my company, here is what we try to run (query at the
> > > >>>> bottom
> > > >>> of
> > > >>>> the mail).
> > > >>>>
> > > >>>> For some context:
> > > >>>>
> > > >>>> - This is a search for academia papers. Papers have multiple
> > > >>>> authors, and
> > > >>>> authors are part of multiple universities. Papers also have
> multiple
> > > >>>> keywords and are generally part of a set (an issue) itself part of
> > > >>>> a set
> > > >>> (a
> > > >>>> volume) itself part of a set (a journal).
> > > >>>> - Our goal is to have a multicriteria search front end, so the
> > > >>>> query is
> > > >>>> generated from a form with clauses selected by the user. The
> > > >>>> structure is
> > > >>>> always the same, this example use a single condition on the
> "keyword"
> > > >>>> - The set of data is relatively small - around 150k papers (so
> > > >>>> probably
> > > >>> 1M
> > > >>>> triples there), probably around 500k authors
> > > >>>> - We use group/concat as we want to give as results one line per
> paper
> > > >>> (vs
> > > >>>> having one per paper per keyword for example)
> > > >>>> - I've read OPTIONALS are pretty bad - but I've no real
> alternative
> > > >>>> here
> > > >>>> that I know off when some fields can be present or not and I don't
> > > >>>> want
> > > >>> to
> > > >>>> throw away all that miss at least one
> > > >>>>
> > > >>>> For our current results, all but the most precise queries (getting
> > > >>>> into a
> > > >>>> super limited set of papers, like <10) get extremely slow (easily
> to
> > > >>> dozens
> > > >>>> of seconds, sometimes more). I feel that there is something
> obvious
> > > >>>> that
> > > >>>> I'm missing, either in the query or my Jena config. The server is
> > > >>>> on an
> > > >>> old
> > > >>>> version but I make my tests locally on a 4.0.0 "out of the box" (0
> > > >>>> configuration).
> > > >>>>
> > > >>>> What I've tried:
> > > >>>>
> > > >>>> - Removing the ORDER does not impact much
> > > >>>> - Removing most optionals works... but remove the point of the
> query
> > > >>>> - Using contains instead of regex does not impact much (I've the
> > > >>>> goal to
> > > >>>> use Jena/Lucene integration for everything text related)
> > > >>>>
> > > >>>> I'm really in for an opinion as taking my RDBMS background this
> is the
> > > >>>> equivalent of less than 3M records split on around 8 tables -
> > > >>>> something
> > > >>>> that should be queryable mostly in sub second times.
> > > >>>>
> > > >>>> Any feedback is most welcome !
> > > >>>>
> > > >>>> Martin
> > > >>>>
> > > >>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> > > >>>>       PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> > > >>>>       PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
> > > >>>>       PREFIX iospress-geocode: <
> http://ld.iospress.nl/rdf/geocode/>
> > > >>>>       PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> > > >>>>       PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> > > >>>>
> > > >>>>       SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
> > > >>>>           (group_concat(distinct ?authorName;separator=", ") as
> > > >>>> ?Authors)
> > > >>>>           (group_concat(distinct ?keyword;separator=", ") as
> > > >>>> ?keywords)
> > > >>>>           (group_concat(distinct ?university;separator=", ") as
> > > >>> ?universities)
> > > >>>>           (group_concat(distinct ?country;separator=", ") as
> > > >>>> ?countries)
> > > >>>>       WHERE {
> > > >>>>           {?paper rdf:type iospress:Chapter.}
> > > >>>>               union
> > > >>>>           {?paper rdf:type iospress:Article.}
> > > >>>>
> > > >>>>           ?paper rdfs:label ?title;
> > > >>>>                    rdf:type ?type;
> > > >>>>
> > > >>>>                    iospress:publicationDate ?pubDate;
> > > >>>>                    iospress:publicationAbstract ?abstract;
> > > >>>>
> > > >>>>                    iospress:publicationIncludesKeyword ?keyword;
> > > >>>>                    iospress:publicationAuthorList [?idx ?author].
> > > >>>>
> > > >>>>           ?issueOrBook iospress:partOf ?volumeOrSerie.
> > > >>>>           ?paper iospress:partOf ?issueOrBook.
> > > >>>>
> > > >>>>
> > > >>>>       OPTIONAL {
> > > >>>>           ?issueOrBook iospress:isbn ?bookIsbn.
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?paper iospress:publicationDoiUrl ?doi.
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?author rdfs:label ?authorName.
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?author iospress:contributorAffiliation ?affiliation.
> > > >>>>           ?affiliation rdfs:label ?university;
> > > >>>>       }
> > > >>>>        OPTIONAL {
> > > >>>>         ?affiliation iospress:geocodingOutput ?geocoded.
> > > >>>>         ?geocoded iospress-geocode:country ?country
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?paper iospress:publicationAccessibility ?access.
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?volumeOrSerie iospress:partOf ?journal;
> > > >>>>       }
> > > >>>>       FILTER(
> > > >>>>           (
> > > >>>>               (datatype(?pubDate) = xsd:date &&
> > > >>>> xsd:dateTime(?pubDate) >
> > > >>>> "1999-12-31T23:00:00.000Z"^^xsd:dateTime &&
> xsd:dateTime(?pubDate) <
> > > >>>> "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
> > > >>>>               (datatype(?pubDate) = xsd:gYear && ?pubDate >=
> > > >>>> "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
> > > >>>>           )
> > > >>>>
> > > >>>>           && (regex (?keyword, "sickness", "i"))
> > > >>>>           )
> > > >>>>       }
> > > >>>>       GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
> > > >>>>
> > > >>>>       ORDER BY ?pubDate ?paper
> > > >>>>       LIMIT 50
> > > >>>>
> > > >>>>
> > > >>>> On Thu, 6 May 2021 at 20:10, Andy Seaborne <[email protected]>
> wrote:
> > > >>>>
> > > >>>>> Hi there,
> > > >>>>>
> > > >>>>> Showing the query would be helpful but some general remarks:
> > > >>>>>
> > > >>>>> 1/ If the query or the setup for Fuseki is needing more than the
> > > >>>>> default
> > > >>>>> heap size, then it might be that the Java JVM is getting into a
> > > >>>>> state of
> > > >>>>> heap exhaustion. This manifests as the CPU loading getting very
> > > >>>>> high. It
> > > >>>>> will seem like nothing is happening (waiting for response).
> > > >>>>>
> > > >>>>> 2/ The query may be expensive.
> > > >>>>>
> > > >>>>> Things to look for
> > > >>>>> * cross products - two parts of the query pattern that are not
> > > >>>>> connected.
> > > >>>>>
> > > >>>>> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
> > > >>>>>
> > > >>>>> * sort, spilling to disk or combined with a cross product the
> query.
> > > >>>>>
> > > >>>>> 3/ If no results are coming back, then the query is form that
> does
> > > >>>>> not
> > > >>>>> stream back - sort, or CONSTRUCT maybe.
> > > >>>>>
> > > >>>>> There was a useful presentation recently that talks about the
> > > >>>>> principles
> > > >>>>> of query efficiency.
> > > >>>>>
> > > >>>>> SPARQL Query Optimization with Pavel Klinov
> > > >>>>> https://www.youtube.com/watch?v=16eMswT2x2Y
> > > >>>>>
> > > >>>>> More inline:
> > > >>>>>
> > > >>>>> On 06/05/2021 09:54, Martin Van Aken wrote:
> > > >>>>>> Hi!
> > > >>>>>> I'm Martin, I'm a software developer new to the Triples/SPARQL
> > > >>>>>> world.
> > > >>> I'm
> > > >>>>>> currently building queries against a Fuseki/TDB backend (that I
> can
> > > >>> work
> > > >>>>> on
> > > >>>>>> too) and I'm getting into significant performance problems
> > > >>>>>> (including
> > > >>>>> never
> > > >>>>>> ending queries).
> > > >>>>>
> > > >>>>> Are updates also happening at the same time?
> > > >>>>>
> > > >>>>>> Despite what I thought was a good search on the apache
> > > >>>>>> jena website I could not find a lot of insight about performance
> > > >>>>>> investigation so I'm trying it here.
> > > >>>>>>
> > > >>>>>> Most of my data experience comes from the relational world (ex:
> > > >>>>>> PG) so
> > > >>>>> I'm
> > > >>>>>> sometimes drawing comparisons there.
> > > >>>>>>
> > > >>>>>> To give some context my data set is around 15 linked concepts,
> > > >>>>>> with the
> > > >>>>>> number of triples for each ranging from some hundreds to 500K -
> > > >>>>>> total
> > > >>>>> less
> > > >>>>>> than 2 millions (documents/authors/publication kind of data).
> > > >>>>>>
> > > >>>>>> Unto questions:
> > > >>>>>>
> > > >>>>>>       - When I'm facing a slow query, what are my investigation
> > > >>> options. Is
> > > >>>>>>       there an equivalent of an "explain plan" in SQL pointing
> to
> > > >>>>>> the
> > > >>> query
> > > >>>>>>       specific slow points? What's the advised way for
> performance
> > > >>> checks
> > > >>>>> in
> > > >>>>>>       SPARQL?
> > > >>>>>
> > > >>>>> qparse --print=opt --file query.rq
> > > >>>>>
> > > >>>>>>       - Are there any performance setups to be aware of on the
> > > >>>>>> server
> > > >>> side?
> > > >>>>>>       Like ways to check indexes are correctly built (outside of
> > > >>>>>> text
> > > >>>>> search that
> > > >>>>>>       I'm not working with for the moment)
> > > >>>>>>       - We're currently using TDB1. I've seen the transactional
> > > >>> benefits of
> > > >>>>>>       TDB2 - are there performance improvements too that would
> > > >>>>>> warrant a
> > > >>>>>>       migration there ?
> > > >>>>>
> > > >>>>> Not on the query side.
> > > >>>>>
> > > >>>>>        Andy
> > > >>>>>
> > > >>>>>>
> > > >>>>>> Thanks a lot already!
> > > >>>>>>
> > > >>>>>> Martin
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > > >>
> > >
> > >
> >
> > --
> > *Martin Van Aken - **Freelance Enthusiast Developer*
> >
> > Mobile : +32 486 899 652
> >
> > Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
> > Call me on Skype : vanakenm
> > Hang out with me : [email protected]
> > Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
> > Company website : www.joyouscoding.com
>


-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : [email protected]
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Reply via email to