Great, thanks for the fast and precise answers, will try this. Martin
On Tue, 25 May 2021 at 09:25, Martynas Jusevičius <marty...@atomgraph.com> wrote: > You get Turtle data because CONSTRUCT/DESCRIBE forms return a graph > while ASK/SELECT return a tabular result set. > > You can try Accept: application/ld+json request header in order to get > JSON-LD data: https://www.w3.org/TR/json-ld11/ > > If you need a connected node in the description, you'll need to add a > pattern that leads to it and then add the variable to DESCRIBE. > > On Tue, May 25, 2021 at 9:11 AM Martin Van Aken <mar...@joyouscoding.com> > wrote: > > > > Hello again, > > Thanks Steve & Lorenz - I'll have a look at nested optionals (did not > > realize that was a thing). > > > > I've made tests with DESCRIBE and this seems to be the way to go - I've > the > > major performance improvement I needed (like 10x). This leaves me with > two > > more questions: > > > > - It seems that DESCRIBE always returns some kind of TTL format - is > there > > a hidden way to get JSON (like for a SELECT) query or is this by design? > > It's not blocking but would mean some parsing of the results > > - It seems DESCRIBE (in Jena, as I understood this is implementation > > dependent) limited to the object itself (i.e. all objects linked to a > > specific subject). This works for most of my needs, but I've some related > > data I want to get too - what's the way there? Make a secondary query to > > get those (ex: I'll get papers back, but papers are linked to authors > that > > are working in universities and I'd need those too)? If I do so and want > to > > avoid a "SELECT N+1" kind of problem (sending a secondary query per > record) > > is there some kind of "WHERE ?paper IN (..., ..., ...)" or do I just play > > with OR clauses? > > > > Thanks again, this ML is having a huge impact on my knowledge & the > linked > > data project I'm working on, this is much appreciated. > > > > Martin > > > > On Thu, 20 May 2021 at 15:34, Steve Vestal < > steve.ves...@adventiumlabs.com> > > wrote: > > > > > Andy pointed at sequential OPTIONALs. One example I have seen had > > > nested OPTIONAL clauses to address a performance issue. Might that be > > > helpful here? > > > > > > On 5/20/2021 5:43 AM, Andy Seaborne wrote: > > > > > > > > > > > > On 20/05/2021 09:36, Martin Van Aken wrote: > > > >> Andy, > > > >> A big thanks for this - it gives me some paths to explore. I think > > > >> indeed > > > >> my biggest problems are in the optional parts - I'll run the test > you > > > >> advised and also look in which case I may be able to get rid of the > > > >> optionals to avoid those situations that could lead to a big amount > of > > > >> results as you mentioned. I'm already looking at getting my filters > > > >> closer > > > >> to definition - can this be done for things other than pure equality > > > >> (for > > > >> example for the date that are testing for a range?). > > > >> > > > >> Maybe one question about optional - I use them in some cases to > avoid > > > >> empty > > > >> results. An example is Access - some paper have an Access triple > > > >> (Open or > > > >> Closed) - but some have none. My understanding is that if I make a > link > > > >> without optional like: > > > >> > > > >> ?paper iospress:accessibility ?access > > > > > > > > If it is just one triple in the optional is less likely to be bad but > > > > if the query uses the variable unbound later on, there will be a very > > > > large number of results, many duplicates and not actually related to > > > > the ?paper. I am guessing but I would be surprised is your query has > > > > variants of this and it is hidden by the "distinct". > > > > > > > > This is the problem at: > > > > > > > > >> --- > > > > >> OPTIONAL { > > > > >> ?author iospress:contributorAffiliation ?affiliation. > > > > >> ?affiliation rdfs:label ?university; > > > > >> } > > > > >> OPTIONAL { > > > > >> ?affiliation iospress:geocodingOutput ?geocoded. > > > > >> ?geocoded iospress-geocode:country ?country > > > > >> } > > > > >> --- > > > > > > > > If no ?affiliation, then the second OPTIONAL is over the whole > > > > database which I'm guess is many results. > > > > > > > > Andy > > > > > > > >> this will de facto remove all papers without access from the set. > > > >> This is > > > >> something I don't want (I want them in the list, just with an empty > > > >> value > > > >> there) - and my understanding is that the way to manage this is an > > > >> Optional. Is this correct? Is there a "better" way? If this ends up > > > >> being > > > >> costly, I could also check to actually have a value for those (those > > > >> without value are technically "Closed"). > > > >> > > > >> Something I was wondering also is whether it makes sense to split > the > > > >> fields I need for search/filtering vs the ones I want to see on the > > > >> result. > > > >> I've a feeling that in theory I could play with two queries - one > > > >> with only > > > >> the params I need for the filtering, then play something similar to > > > >> DESCRIBE on each record on the filtered set - but I've no idea if > this > > > >> would be more performant than keeping it together as it is now. > > > >> > > > >> Anyway, the exchanges here are much appreciated! > > > >> > > > >> On Tue, 18 May 2021 at 19:18, Andy Seaborne <a...@apache.org> > wrote: > > > >> > > > >>> Martin, > > > >>> > > > >>> That's a complicated query and I haven't got my head aroud it > > > >>> completely > > > >>> yet. > > > >>> > > > >>> There are some useful points to understand: > > > >>> > > > >>> A:: > > > >>> > > > >>> What is the time and outcome of these queries that focus on the > main > > > >>> data location part: > > > >>> > > > >>> 1/ > > > >>> > > > >>> SELECT (count(*) AS ?C) { > > > >>> ?paper iospress:publicationDate ?pubDate > > > >>> FILTER(...date test...) > > > >>> } > > > >>> > > > >>> 2/ > > > >>> SELECT (count(*) AS ?C) { > > > >>> ?paper iospress:publicationDate ?pubDate > > > >>> iospress:publicationIncludesKeyword ?keyword . > > > >>> FILETER (...date... && (regex (?keyword, "sickness", "i")) > > > >>> > > > >>> 3/ > > > >>> SELECT (count(*) AS ?C) { > > > >>> {?paper rdf:type iospress:Chapter.} > > > >>> union > > > >>> {?paper rdf:type iospress:Article.} > > > >>> ?paper iospress:publicationDate ?pubDate > > > >>> FILTER(...date test)) > > > >>> } > > > >>> > > > >>> 4/ > > > >>> SELECT (count(*) AS ?C) { > > > >>> ?paper iospress:publicationDate ?pubDate > > > >>> FILTER(.. date test...) > > > >>> {?paper rdf:type iospress:Chapter.} > > > >>> union > > > >>> {?paper rdf:type iospress:Article.} > > > >>> } > > > >>> > > > >>> B:: > > > >>> > > > >>> then is it the case that some optionals have more effect than > others? > > > >>> Some are "high risk": > > > >>> > > > >>> --- > > > >>> OPTIONAL { > > > >>> ?author iospress:contributorAffiliation ?affiliation. > > > >>> ?affiliation rdfs:label ?university; > > > >>> } > > > >>> OPTIONAL { > > > >>> ?affiliation iospress:geocodingOutput ?geocoded. > > > >>> ?geocoded iospress-geocode:country ?country > > > >>> } > > > >>> --- > > > >>> Suppose the first does not match then the second is a lot of > results > > > >>> unrelated to ?paper. > > > >>> > > > >>> C:: > > > >>> > > > >>> distinct > > > >>> > > > >>> it might be worth trying without distinct because distinct can > cause a > > > >>> lot of results to be reduced to just a few, hiding redundant work. > > > >>> > > > >>> Andy > > > >>> > > > >>> On 18/05/2021 13:31, Martin Van Aken wrote: > > > >>>> Hello again, > > > >>>> After some more days of me trying to get a better performance & > the > > > >>>> approval of my company, here is what we try to run (query at the > > > >>>> bottom > > > >>> of > > > >>>> the mail). > > > >>>> > > > >>>> For some context: > > > >>>> > > > >>>> - This is a search for academia papers. Papers have multiple > > > >>>> authors, and > > > >>>> authors are part of multiple universities. Papers also have > multiple > > > >>>> keywords and are generally part of a set (an issue) itself part of > > > >>>> a set > > > >>> (a > > > >>>> volume) itself part of a set (a journal). > > > >>>> - Our goal is to have a multicriteria search front end, so the > > > >>>> query is > > > >>>> generated from a form with clauses selected by the user. The > > > >>>> structure is > > > >>>> always the same, this example use a single condition on the > "keyword" > > > >>>> - The set of data is relatively small - around 150k papers (so > > > >>>> probably > > > >>> 1M > > > >>>> triples there), probably around 500k authors > > > >>>> - We use group/concat as we want to give as results one line per > paper > > > >>> (vs > > > >>>> having one per paper per keyword for example) > > > >>>> - I've read OPTIONALS are pretty bad - but I've no real > alternative > > > >>>> here > > > >>>> that I know off when some fields can be present or not and I don't > > > >>>> want > > > >>> to > > > >>>> throw away all that miss at least one > > > >>>> > > > >>>> For our current results, all but the most precise queries (getting > > > >>>> into a > > > >>>> super limited set of papers, like <10) get extremely slow (easily > to > > > >>> dozens > > > >>>> of seconds, sometimes more). I feel that there is something > obvious > > > >>>> that > > > >>>> I'm missing, either in the query or my Jena config. The server is > > > >>>> on an > > > >>> old > > > >>>> version but I make my tests locally on a 4.0.0 "out of the box" (0 > > > >>>> configuration). > > > >>>> > > > >>>> What I've tried: > > > >>>> > > > >>>> - Removing the ORDER does not impact much > > > >>>> - Removing most optionals works... but remove the point of the > query > > > >>>> - Using contains instead of regex does not impact much (I've the > > > >>>> goal to > > > >>>> use Jena/Lucene integration for everything text related) > > > >>>> > > > >>>> I'm really in for an opinion as taking my RDBMS background this > is the > > > >>>> equivalent of less than 3M records split on around 8 tables - > > > >>>> something > > > >>>> that should be queryable mostly in sub second times. > > > >>>> > > > >>>> Any feedback is most welcome ! > > > >>>> > > > >>>> Martin > > > >>>> > > > >>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> > > > >>>> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> > > > >>>> PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/> > > > >>>> PREFIX iospress-geocode: < > http://ld.iospress.nl/rdf/geocode/> > > > >>>> PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/> > > > >>>> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> > > > >>>> > > > >>>> SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access > > > >>>> (group_concat(distinct ?authorName;separator=", ") as > > > >>>> ?Authors) > > > >>>> (group_concat(distinct ?keyword;separator=", ") as > > > >>>> ?keywords) > > > >>>> (group_concat(distinct ?university;separator=", ") as > > > >>> ?universities) > > > >>>> (group_concat(distinct ?country;separator=", ") as > > > >>>> ?countries) > > > >>>> WHERE { > > > >>>> {?paper rdf:type iospress:Chapter.} > > > >>>> union > > > >>>> {?paper rdf:type iospress:Article.} > > > >>>> > > > >>>> ?paper rdfs:label ?title; > > > >>>> rdf:type ?type; > > > >>>> > > > >>>> iospress:publicationDate ?pubDate; > > > >>>> iospress:publicationAbstract ?abstract; > > > >>>> > > > >>>> iospress:publicationIncludesKeyword ?keyword; > > > >>>> iospress:publicationAuthorList [?idx ?author]. > > > >>>> > > > >>>> ?issueOrBook iospress:partOf ?volumeOrSerie. > > > >>>> ?paper iospress:partOf ?issueOrBook. > > > >>>> > > > >>>> > > > >>>> OPTIONAL { > > > >>>> ?issueOrBook iospress:isbn ?bookIsbn. > > > >>>> } > > > >>>> OPTIONAL { > > > >>>> ?paper iospress:publicationDoiUrl ?doi. > > > >>>> } > > > >>>> OPTIONAL { > > > >>>> ?author rdfs:label ?authorName. > > > >>>> } > > > >>>> OPTIONAL { > > > >>>> ?author iospress:contributorAffiliation ?affiliation. > > > >>>> ?affiliation rdfs:label ?university; > > > >>>> } > > > >>>> OPTIONAL { > > > >>>> ?affiliation iospress:geocodingOutput ?geocoded. > > > >>>> ?geocoded iospress-geocode:country ?country > > > >>>> } > > > >>>> OPTIONAL { > > > >>>> ?paper iospress:publicationAccessibility ?access. > > > >>>> } > > > >>>> OPTIONAL { > > > >>>> ?volumeOrSerie iospress:partOf ?journal; > > > >>>> } > > > >>>> FILTER( > > > >>>> ( > > > >>>> (datatype(?pubDate) = xsd:date && > > > >>>> xsd:dateTime(?pubDate) > > > > >>>> "1999-12-31T23:00:00.000Z"^^xsd:dateTime && > xsd:dateTime(?pubDate) < > > > >>>> "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) || > > > >>>> (datatype(?pubDate) = xsd:gYear && ?pubDate >= > > > >>>> "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear) > > > >>>> ) > > > >>>> > > > >>>> && (regex (?keyword, "sickness", "i")) > > > >>>> ) > > > >>>> } > > > >>>> GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access > > > >>>> > > > >>>> ORDER BY ?pubDate ?paper > > > >>>> LIMIT 50 > > > >>>> > > > >>>> > > > >>>> On Thu, 6 May 2021 at 20:10, Andy Seaborne <a...@apache.org> > wrote: > > > >>>> > > > >>>>> Hi there, > > > >>>>> > > > >>>>> Showing the query would be helpful but some general remarks: > > > >>>>> > > > >>>>> 1/ If the query or the setup for Fuseki is needing more than the > > > >>>>> default > > > >>>>> heap size, then it might be that the Java JVM is getting into a > > > >>>>> state of > > > >>>>> heap exhaustion. This manifests as the CPU loading getting very > > > >>>>> high. It > > > >>>>> will seem like nothing is happening (waiting for response). > > > >>>>> > > > >>>>> 2/ The query may be expensive. > > > >>>>> > > > >>>>> Things to look for > > > >>>>> * cross products - two parts of the query pattern that are not > > > >>>>> connected. > > > >>>>> > > > >>>>> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database. > > > >>>>> > > > >>>>> * sort, spilling to disk or combined with a cross product the > query. > > > >>>>> > > > >>>>> 3/ If no results are coming back, then the query is form that > does > > > >>>>> not > > > >>>>> stream back - sort, or CONSTRUCT maybe. > > > >>>>> > > > >>>>> There was a useful presentation recently that talks about the > > > >>>>> principles > > > >>>>> of query efficiency. > > > >>>>> > > > >>>>> SPARQL Query Optimization with Pavel Klinov > > > >>>>> https://www.youtube.com/watch?v=16eMswT2x2Y > > > >>>>> > > > >>>>> More inline: > > > >>>>> > > > >>>>> On 06/05/2021 09:54, Martin Van Aken wrote: > > > >>>>>> Hi! > > > >>>>>> I'm Martin, I'm a software developer new to the Triples/SPARQL > > > >>>>>> world. > > > >>> I'm > > > >>>>>> currently building queries against a Fuseki/TDB backend (that I > can > > > >>> work > > > >>>>> on > > > >>>>>> too) and I'm getting into significant performance problems > > > >>>>>> (including > > > >>>>> never > > > >>>>>> ending queries). > > > >>>>> > > > >>>>> Are updates also happening at the same time? > > > >>>>> > > > >>>>>> Despite what I thought was a good search on the apache > > > >>>>>> jena website I could not find a lot of insight about performance > > > >>>>>> investigation so I'm trying it here. > > > >>>>>> > > > >>>>>> Most of my data experience comes from the relational world (ex: > > > >>>>>> PG) so > > > >>>>> I'm > > > >>>>>> sometimes drawing comparisons there. > > > >>>>>> > > > >>>>>> To give some context my data set is around 15 linked concepts, > > > >>>>>> with the > > > >>>>>> number of triples for each ranging from some hundreds to 500K - > > > >>>>>> total > > > >>>>> less > > > >>>>>> than 2 millions (documents/authors/publication kind of data). > > > >>>>>> > > > >>>>>> Unto questions: > > > >>>>>> > > > >>>>>> - When I'm facing a slow query, what are my investigation > > > >>> options. Is > > > >>>>>> there an equivalent of an "explain plan" in SQL pointing > to > > > >>>>>> the > > > >>> query > > > >>>>>> specific slow points? What's the advised way for > performance > > > >>> checks > > > >>>>> in > > > >>>>>> SPARQL? > > > >>>>> > > > >>>>> qparse --print=opt --file query.rq > > > >>>>> > > > >>>>>> - Are there any performance setups to be aware of on the > > > >>>>>> server > > > >>> side? > > > >>>>>> Like ways to check indexes are correctly built (outside of > > > >>>>>> text > > > >>>>> search that > > > >>>>>> I'm not working with for the moment) > > > >>>>>> - We're currently using TDB1. I've seen the transactional > > > >>> benefits of > > > >>>>>> TDB2 - are there performance improvements too that would > > > >>>>>> warrant a > > > >>>>>> migration there ? > > > >>>>> > > > >>>>> Not on the query side. > > > >>>>> > > > >>>>> Andy > > > >>>>> > > > >>>>>> > > > >>>>>> Thanks a lot already! > > > >>>>>> > > > >>>>>> Martin > > > >>>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>> > > > >> > > > >> > > > > > > > > > > -- > > *Martin Van Aken - **Freelance Enthusiast Developer* > > > > Mobile : +32 486 899 652 > > > > Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken> > > Call me on Skype : vanakenm > > Hang out with me : mar...@joyouscoding.com > > Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken > > Company website : www.joyouscoding.com > -- *Martin Van Aken - **Freelance Enthusiast Developer* Mobile : +32 486 899 652 Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken> Call me on Skype : vanakenm Hang out with me : mar...@joyouscoding.com Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken Company website : www.joyouscoding.com