Hello again, After some more days of me trying to get a better performance & the approval of my company, here is what we try to run (query at the bottom of the mail).
For some context: - This is a search for academia papers. Papers have multiple authors, and authors are part of multiple universities. Papers also have multiple keywords and are generally part of a set (an issue) itself part of a set (a volume) itself part of a set (a journal). - Our goal is to have a multicriteria search front end, so the query is generated from a form with clauses selected by the user. The structure is always the same, this example use a single condition on the "keyword" - The set of data is relatively small - around 150k papers (so probably 1M triples there), probably around 500k authors - We use group/concat as we want to give as results one line per paper (vs having one per paper per keyword for example) - I've read OPTIONALS are pretty bad - but I've no real alternative here that I know off when some fields can be present or not and I don't want to throw away all that miss at least one For our current results, all but the most precise queries (getting into a super limited set of papers, like <10) get extremely slow (easily to dozens of seconds, sometimes more). I feel that there is something obvious that I'm missing, either in the query or my Jena config. The server is on an old version but I make my tests locally on a 4.0.0 "out of the box" (0 configuration). What I've tried: - Removing the ORDER does not impact much - Removing most optionals works... but remove the point of the query - Using contains instead of regex does not impact much (I've the goal to use Jena/Lucene integration for everything text related) I'm really in for an opinion as taking my RDBMS background this is the equivalent of less than 3M records split on around 8 tables - something that should be queryable mostly in sub second times. Any feedback is most welcome ! Martin PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/> PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/> PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access (group_concat(distinct ?authorName;separator=", ") as ?Authors) (group_concat(distinct ?keyword;separator=", ") as ?keywords) (group_concat(distinct ?university;separator=", ") as ?universities) (group_concat(distinct ?country;separator=", ") as ?countries) WHERE { {?paper rdf:type iospress:Chapter.} union {?paper rdf:type iospress:Article.} ?paper rdfs:label ?title; rdf:type ?type; iospress:publicationDate ?pubDate; iospress:publicationAbstract ?abstract; iospress:publicationIncludesKeyword ?keyword; iospress:publicationAuthorList [?idx ?author]. ?issueOrBook iospress:partOf ?volumeOrSerie. ?paper iospress:partOf ?issueOrBook. OPTIONAL { ?issueOrBook iospress:isbn ?bookIsbn. } OPTIONAL { ?paper iospress:publicationDoiUrl ?doi. } OPTIONAL { ?author rdfs:label ?authorName. } OPTIONAL { ?author iospress:contributorAffiliation ?affiliation. ?affiliation rdfs:label ?university; } OPTIONAL { ?affiliation iospress:geocodingOutput ?geocoded. ?geocoded iospress-geocode:country ?country } OPTIONAL { ?paper iospress:publicationAccessibility ?access. } OPTIONAL { ?volumeOrSerie iospress:partOf ?journal; } FILTER( ( (datatype(?pubDate) = xsd:date && xsd:dateTime(?pubDate) > "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) < "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) || (datatype(?pubDate) = xsd:gYear && ?pubDate >= "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear) ) && (regex (?keyword, "sickness", "i")) ) } GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access ORDER BY ?pubDate ?paper LIMIT 50 On Thu, 6 May 2021 at 20:10, Andy Seaborne <[email protected]> wrote: > Hi there, > > Showing the query would be helpful but some general remarks: > > 1/ If the query or the setup for Fuseki is needing more than the default > heap size, then it might be that the Java JVM is getting into a state of > heap exhaustion. This manifests as the CPU loading getting very high. It > will seem like nothing is happening (waiting for response). > > 2/ The query may be expensive. > > Things to look for > * cross products - two parts of the query pattern that are not > connected. > > { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database. > > * sort, spilling to disk or combined with a cross product the query. > > 3/ If no results are coming back, then the query is form that does not > stream back - sort, or CONSTRUCT maybe. > > There was a useful presentation recently that talks about the principles > of query efficiency. > > SPARQL Query Optimization with Pavel Klinov > https://www.youtube.com/watch?v=16eMswT2x2Y > > More inline: > > On 06/05/2021 09:54, Martin Van Aken wrote: > > Hi! > > I'm Martin, I'm a software developer new to the Triples/SPARQL world. I'm > > currently building queries against a Fuseki/TDB backend (that I can work > on > > too) and I'm getting into significant performance problems (including > never > > ending queries). > > Are updates also happening at the same time? > > > Despite what I thought was a good search on the apache > > jena website I could not find a lot of insight about performance > > investigation so I'm trying it here. > > > > Most of my data experience comes from the relational world (ex: PG) so > I'm > > sometimes drawing comparisons there. > > > > To give some context my data set is around 15 linked concepts, with the > > number of triples for each ranging from some hundreds to 500K - total > less > > than 2 millions (documents/authors/publication kind of data). > > > > Unto questions: > > > > - When I'm facing a slow query, what are my investigation options. Is > > there an equivalent of an "explain plan" in SQL pointing to the query > > specific slow points? What's the advised way for performance checks > in > > SPARQL? > > qparse --print=opt --file query.rq > > > - Are there any performance setups to be aware of on the server side? > > Like ways to check indexes are correctly built (outside of text > search that > > I'm not working with for the moment) > > - We're currently using TDB1. I've seen the transactional benefits of > > TDB2 - are there performance improvements too that would warrant a > > migration there ? > > Not on the query side. > > Andy > > > > > Thanks a lot already! > > > > Martin > > > -- *Martin Van Aken - **Freelance Enthusiast Developer* Mobile : +32 486 899 652 Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken> Call me on Skype : vanakenm Hang out with me : [email protected] Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken Company website : www.joyouscoding.com
