On 5/20/2021 5:43 AM, Andy Seaborne wrote:
On 20/05/2021 09:36, Martin Van Aken wrote:Andy,A big thanks for this - it gives me some paths to explore. I think indeedmy biggest problems are in the optional parts - I'll run the test you advised and also look in which case I may be able to get rid of the optionals to avoid those situations that could lead to a big amount ofresults as you mentioned. I'm already looking at getting my filters closer to definition - can this be done for things other than pure equality (forexample for the date that are testing for a range?).Maybe one question about optional - I use them in some cases to avoid empty results. An example is Access - some paper have an Access triple (Open orClosed) - but some have none. My understanding is that if I make a link without optional like: ?paper iospress:accessibility ?accessIf it is just one triple in the optional is less likely to be bad but if the query uses the variable unbound later on, there will be a very large number of results, many duplicates and not actually related to the ?paper. I am guessing but I would be surprised is your query has variants of this and it is hidden by the "distinct".This is the problem at: >> --- >> OPTIONAL { >> ?author iospress:contributorAffiliation ?affiliation. >> ?affiliation rdfs:label ?university; >> } >> OPTIONAL { >> ?affiliation iospress:geocodingOutput ?geocoded. >> ?geocoded iospress-geocode:country ?country >> } >> ---If no ?affiliation, then the second OPTIONAL is over the whole database which I'm guess is many results.Andythis will de facto remove all papers without access from the set. This is something I don't want (I want them in the list, just with an empty valuethere) - and my understanding is that the way to manage this is anOptional. Is this correct? Is there a "better" way? If this ends up beingcostly, I could also check to actually have a value for those (those without value are technically "Closed"). Something I was wondering also is whether it makes sense to split thefields I need for search/filtering vs the ones I want to see on the result. I've a feeling that in theory I could play with two queries - one with onlythe params I need for the filtering, then play something similar to DESCRIBE on each record on the filtered set - but I've no idea if this would be more performant than keeping it together as it is now. Anyway, the exchanges here are much appreciated! On Tue, 18 May 2021 at 19:18, Andy Seaborne <[email protected]> wrote:Martin,That's a complicated query and I haven't got my head aroud it completelyyet. There are some useful points to understand: A:: What is the time and outcome of these queries that focus on the main data location part: 1/ SELECT (count(*) AS ?C) { ?paper iospress:publicationDate ?pubDate FILTER(...date test...) } 2/ SELECT (count(*) AS ?C) { ?paper iospress:publicationDate ?pubDate iospress:publicationIncludesKeyword ?keyword . FILETER (...date... && (regex (?keyword, "sickness", "i")) 3/ SELECT (count(*) AS ?C) { {?paper rdf:type iospress:Chapter.} union {?paper rdf:type iospress:Article.} ?paper iospress:publicationDate ?pubDate FILTER(...date test)) } 4/ SELECT (count(*) AS ?C) { ?paper iospress:publicationDate ?pubDate FILTER(.. date test...) {?paper rdf:type iospress:Chapter.} union {?paper rdf:type iospress:Article.} } B:: then is it the case that some optionals have more effect than others? Some are "high risk": --- OPTIONAL { ?author iospress:contributorAffiliation ?affiliation. ?affiliation rdfs:label ?university; } OPTIONAL { ?affiliation iospress:geocodingOutput ?geocoded. ?geocoded iospress-geocode:country ?country } --- Suppose the first does not match then the second is a lot of results unrelated to ?paper. C:: distinct it might be worth trying without distinct because distinct can cause a lot of results to be reduced to just a few, hiding redundant work. Andy On 18/05/2021 13:31, Martin Van Aken wrote:Hello again, After some more days of me trying to get a better performance & theapproval of my company, here is what we try to run (query at the bottomofthe mail). For some context:- This is a search for academia papers. Papers have multiple authors, andauthors are part of multiple universities. Papers also have multiplekeywords and are generally part of a set (an issue) itself part of a set(avolume) itself part of a set (a journal).- Our goal is to have a multicriteria search front end, so the query is generated from a form with clauses selected by the user. The structure isalways the same, this example use a single condition on the "keyword"- The set of data is relatively small - around 150k papers (so probably1Mtriples there), probably around 500k authors - We use group/concat as we want to give as results one line per paper(vshaving one per paper per keyword for example)- I've read OPTIONALS are pretty bad - but I've no real alternative here that I know off when some fields can be present or not and I don't wanttothrow away all that miss at least oneFor our current results, all but the most precise queries (getting into asuper limited set of papers, like <10) get extremely slow (easily todozensof seconds, sometimes more). I feel that there is something obvious that I'm missing, either in the query or my Jena config. The server is on anoldversion but I make my tests locally on a 4.0.0 "out of the box" (0 configuration). What I've tried: - Removing the ORDER does not impact much - Removing most optionals works... but remove the point of the query- Using contains instead of regex does not impact much (I've the goal touse Jena/Lucene integration for everything text related) I'm really in for an opinion as taking my RDBMS background this is theequivalent of less than 3M records split on around 8 tables - somethingthat should be queryable mostly in sub second times. Any feedback is most welcome ! Martin PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/> PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/> PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access(group_concat(distinct ?authorName;separator=", ") as ?Authors) (group_concat(distinct ?keyword;separator=", ") as ?keywords)(group_concat(distinct ?university;separator=", ") as?universities)(group_concat(distinct ?country;separator=", ") as ?countries)WHERE { {?paper rdf:type iospress:Chapter.} union {?paper rdf:type iospress:Article.} ?paper rdfs:label ?title; rdf:type ?type; iospress:publicationDate ?pubDate; iospress:publicationAbstract ?abstract; iospress:publicationIncludesKeyword ?keyword; iospress:publicationAuthorList [?idx ?author]. ?issueOrBook iospress:partOf ?volumeOrSerie. ?paper iospress:partOf ?issueOrBook. OPTIONAL { ?issueOrBook iospress:isbn ?bookIsbn. } OPTIONAL { ?paper iospress:publicationDoiUrl ?doi. } OPTIONAL { ?author rdfs:label ?authorName. } OPTIONAL { ?author iospress:contributorAffiliation ?affiliation. ?affiliation rdfs:label ?university; } OPTIONAL { ?affiliation iospress:geocodingOutput ?geocoded. ?geocoded iospress-geocode:country ?country } OPTIONAL { ?paper iospress:publicationAccessibility ?access. } OPTIONAL { ?volumeOrSerie iospress:partOf ?journal; } FILTER( ((datatype(?pubDate) = xsd:date && xsd:dateTime(?pubDate) >"1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) < "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) || (datatype(?pubDate) = xsd:gYear && ?pubDate >= "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear) ) && (regex (?keyword, "sickness", "i")) ) } GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access ORDER BY ?pubDate ?paper LIMIT 50 On Thu, 6 May 2021 at 20:10, Andy Seaborne <[email protected]> wrote:Hi there, Showing the query would be helpful but some general remarks:1/ If the query or the setup for Fuseki is needing more than the default heap size, then it might be that the Java JVM is getting into a state of heap exhaustion. This manifests as the CPU loading getting very high. Itwill seem like nothing is happening (waiting for response). 2/ The query may be expensive. Things to look for * cross products - two parts of the query pattern that are not connected. { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database. * sort, spilling to disk or combined with a cross product the query.3/ If no results are coming back, then the query is form that does notstream back - sort, or CONSTRUCT maybe.There was a useful presentation recently that talks about the principlesof query efficiency. SPARQL Query Optimization with Pavel Klinov https://www.youtube.com/watch?v=16eMswT2x2Y More inline: On 06/05/2021 09:54, Martin Van Aken wrote:Hi!I'm Martin, I'm a software developer new to the Triples/SPARQL world.I'mcurrently building queries against a Fuseki/TDB backend (that I canworkontoo) and I'm getting into significant performance problems (includingneverending queries).Are updates also happening at the same time?Despite what I thought was a good search on the apache jena website I could not find a lot of insight about performance investigation so I'm trying it here.Most of my data experience comes from the relational world (ex: PG) soI'msometimes drawing comparisons there.To give some context my data set is around 15 linked concepts, with the number of triples for each ranging from some hundreds to 500K - totallessthan 2 millions (documents/authors/publication kind of data). Unto questions: - When I'm facing a slow query, what are my investigationoptions. Isthere an equivalent of an "explain plan" in SQL pointing to thequeryspecific slow points? What's the advised way for performancechecksinSPARQL?qparse --print=opt --file query.rq- Are there any performance setups to be aware of on the serverside?Like ways to check indexes are correctly built (outside of textsearch thatI'm not working with for the moment) - We're currently using TDB1. I've seen the transactionalbenefits ofTDB2 - are there performance improvements too that would warrant amigration there ?Not on the query side. AndyThanks a lot already! Martin
OpenPGP_signature
Description: OpenPGP digital signature
