Hi Nadav, your is quite a common use case when you want to drive a UI having a SPARQL endpoint in your backend.
The problem, as you said, is that some sorting needs to happen and then slicing (i.e. LIMIT + OFFSET). When you have a large dataset you need to sort or scan through a lot of data. Moreover, currently, each page you hit (== each query you run, same query with just a different OFFSET) will need to do the same. We are well aware of this. You might be interested in this thread from jena-dev mailing list: http://markmail.org/message/p5x334m7dy676oik In particular point 3. which I report verbatim here: "3/ Paging. The idiom of a sequence of SELECT / ORDER BY / OFFSET / LIMIT calls with changes in OFFSET to get different slices of a result set happens in linked data apps (and others). We've been optimizing these in ARQ using "top N" queries but LD-Access can offer facilities at a different granularity. Catch that query, issue the full SELECT / ORDER BY query, cache the results. Then you can slice the results as pages without going back to the server. One side effect of this is paging without sorting, another is moving sorting away from the origin server. Sorting is expensive but it's needed to guarantee stability of the result set being sliced into pages. So issue the query as SELECT and either sort locally (you get to choose the resources available), to get the same sorted pageable results. Or if ordering is only for stability, just remove the ORDER by and replace with a promise to slice from an unchanging result set." We tried to improve things as much as possible from the query engine optimization point of view: https://issues.apache.org/jira/browse/JENA-89 https://issues.apache.org/jira/browse/JENA-90 https://issues.apache.org/jira/browse/JENA-108 https://issues.apache.org/jira/browse/JENA-109 https://issues.apache.org/jira/browse/JENA-111 https://issues.apache.org/jira/browse/JENA-114 Judging from the amount of effort I put into this, you can imagine I have problems very similar to yours. :-) I don't thing there is something more we can do from the point of view of the query engine. If someone has good ideas on this, I am all ear. Views are an option (== a sort of internal cache). But, then something like LD-Access described by Andy would be much better. Since you can use it with remote SPARQL endpoints and with different implementations. To conclude, I do not have a good answer to your question... other than caching. But, at the moment, you do not have something which you can use out-of-the-box from us. Do you need this for a commercial product/service/project? Here is another option for you and your company: "The Epimorphics team has unparalleled expertise in the development of Apache Jena and includes many of the original developers. Epimorphics has 4 committers on the Apache Jena project. [...] We also carry out custom development of extensions to Jena and Jena based systems. For more information on any of these support packages, or other Jena-related services, please contact [email protected]." -- http://www.epimorphics.com/web/support (disclaimer: I do not work for Epimorphics). I am sure a lot of out Jena (and Fuseki) users will benefit immensely from a proper caching layer. By the way, how big is your dataset (in terms of triples)? Are you using TDB? Joseki? Fuseki? What are you using to run your queries? How much RAM do you have on the machine? Is it a 64 bit OS and JVM? There is a lot you can do to tune your performances... tell us more details. Regards, Paolo Nadav Hoze wrote: > hi, > I have a medical ontology stored with Jena tdb. > the object model is quite simple: > 1. we have medical concepts that have the following fields: code, code system > (explain it on #3) a unique id and text. > 2. Medical relations between these > concepts. > 3. becuase medical concepts are produced from a certain code system we have > an object for that which is the container for the concepts (details of it are > not important). > > all of the data of course is stored as triples, where as for medical concepts > the id is the triple identifier. > > when I query for all the concepts of a code system the result is huge and I > would like to get it by paging. > now I do support it by using limit offset and sorting but it's extremely > slow because of the sorting every time I ask for the next bulk. > is their a way to do so without sorting, maybe use index? > > thanks, > Sent from orange email services > This email and any files transmitted with it are confidential and intended > solely for the use of the individual or entity to whom they are addressed. > Please note that any disclosure, copying or distribution of the content of > this information is strictly forbidden. If you have received this email > message in error please notify its sender and then delete it from your files.
