Thanks Mr. Anderson, If I understand your comments correctly, then you say that it is more efficient to perform these types of joins on the server side than the client side with an optimized query processor. I fully agree, and I would appreciate if someone could show me how Jena does perform that join on the server side and how this is different from my example.
Regarding server side processing and "central principles for effective query processing", you hit the nail on its head with your statement: My understanding is that the Jena API is operating directly on the underlying data and my example hence is server-side (it is in the same JVM as the TDB database, this is the only supported implementation of the API. See https://jena.apache.org/documentation/tdb/faqs.html#can-i-share-a-tdb-dataset-between-multiple-applications ). There is no such thing as a client version of the Jena API, only SPARQL is supported client side. Do I understand this correctly? It therefore looks like we have a disconnect. I believe that the Jena API is a server side function and you state that the Jena API is a client side function. It would great to clarity into this disconnect. Here is a basic assumption that I have: I may be incorrect, but my impression is that Jena is a set of APIs, not a traditional database. The RDF API is the core that allows interactions with triples, Jena TDB is the persistent file storage, Jena ARQ is the Jena implementation of SPARQL, Jena Fuseki is an implementation of a database, the Jena Ontology API provides a higher level interface to OWL and other models, and the Jena Inferencing API provides reasoning over the data. Users of Jena can use Jena Fuseki or build their own database(s). Do I understand this correctly? Andy is answering my original question about joins, he stated that Jena ARQ is using the Jena API, Graph.find and listStatement (you included this in your response). Again, if I understand this correctly, then Jena ARQ does not implement a join algorithm based on two sorted lists, so the join must be performed using lookups for each element returned from the first list (like I showed in my example). While this is OK for small datasets, it becomes problematic for large datasets. Do I understand this correctly? Regarding my statement "TDB already has indexed lists of OSP, POS and SPO" and your response "were that the case, the api documentation would describe it. does it?". The documentation refers to this here https://jena.apache.org/documentation/tdb/store-parameters.html and in passing other places, I am not aware of any place in the documentation where it states how the API is using these indexes. Again, this goes back to the core of my question of effective joins which are hard to do without indexed data. My goal is to understand how to best use Jena. There may be places that Jena is not a good fit, that is OK, I just need to know where those places are so that we can work around them or avoid them. My gut feeling is that Jena is a great choice when the user needs to follow-her-nose into data and not return large datasets. If the queries are small, specific and returns a small set of data then Jena will provide good performance. If extensive joins are needed or large datasets are returned, then the user have to think about which API to use (core or ARQ); there will be situations where Jena does not provide the optimal solution and may not be the right choice. Finally, regarding my list of concerns and questions. Let's start with a specific one: Is there a SPARQL equivalent to SQL views, functions and stored procedures? I believe that the answer is no, and if it is then what is the best practice to provide this functionality? Again; thanks for your help. Best regards, Niels -----Original Message----- From: james anderson [mailto:[email protected]] Sent: Sunday, November 13, 2016 23:56 To: [email protected] Subject: Re: How do I do a join between multiple model.listStatments calls? good morning; > On 2016-11-14, at 02:10, Niels Andersen <[email protected]> wrote: > > Good evening to you as well Mr. Anderson, > > We are building an application where we will end up with several hundreds of > millions of triples. So the scope of the application could be considered > large. > > As for the initial question about model.listStatements joins, here is a code > snippet: > > [nested loop natural join implementation…] If the query above returned > 10,000 children in iterator1, then iterator2 will be called 10,000 times. > This does not seem to be very efficient. there is no reason to expect that it would be. at an abstract level, it ignores two rather central principles for effective query processing: - if you do not need data, do not touch it. - if you do not need data, do not move it. on one hand, there was this message, earlier in the thread, >> On 2016-11-13, at 22:19, Andy Seaborne <[email protected]> wrote: >> >> ARQ is either as fast at joins as listStatements (because it is using the >> underlying Graph.find that backs listStatement) or is faster because it >> avoids churning lot of unnecessary bytes. >> >> As many NoSQL application have discovered, reinventing joins client side, >> results in a lot of data transfer from data storage to client. which alludes to general experience in this regard. on the other, one could perform concrete timing experiments to determine the respective wild-subject match rate and the statement scan rate for your particular repository statistics, profile the time spent where in the stack, and predict quantitatively, that the approach would likely underperform one which left the join process to mechanisms which is closer to the store and move less data. > > To the best of my knowledge, TDB already has indexed lists of OSP, POS and > SPO. I would have thought that there was a way to run the second query by > just passing an ordered list of the objects returned in the first query. This > provides for far better matching than having to run the same query many > times. were that the case, the api documentation would describe it. does it? > > The alternative approach that we are looking at is to run a second query > where we return all the labels of all objects, store the results of each > query in a HashSet indexed on ObjectResource and do a RetainAll to join the > two sets. The problem with this is that there are way too many labels in the > system to do this effectively. I can also create a code snippet for this if > it is necessary. > > So my question is: What is the correct way to join the results from two > model.listStatements? my question is, why is it necessary to do that on the client side? > > As for the initial question about model.listStatements filtering, here is a > code snippet: > > StmtIterator iterator1 = model.listStatements( > new SimpleSelector(nodeResource, MY_VOCAB.value, (RDFNode)null) > { > public boolean selects(Statement s) > { > // return the object literals > 12345 > return (s.getObject().asLiteral().getInt() > > 12345); > } > }); > In the query above; for every value result, the selector has to do a > comparison with the filter value. I would have thought that it was easier for > TDB to do the filtering, than to include it in a SimpleSelector. > > My question is: What is the correct way to implement filtering? while “correct” depends much on the concrete case, the method, above, relies on the same problematic approach as your join implementation, yet it makes is no case for a mechanism which performs the work on the client side rather than leave it to a query processor. > > As for the long list in my email that I accidentally sent multiple times; I > hope the concerns and questions are clear enough to be answered. Let me know > if clarification is needed. those are the points with permit commiseration only. so long as they remain abstract complaints, it is difficult to bring experience to bear on them. my experience differs from yours in significant ways, but without concrete information, it is not possible to explore, why. your described case would appear to require a query with a single bgp, which contains two statement patterns and a filter. given that case, your complaints leave the impression, that the sparql processor executed queries of that form less effectively than you expected and/or was not stable in the process. at the level of detail which you have supplied, i would not expect that to have been the case. you will need to say more. best regards, from berlin, --- james anderson | [email protected] | http://dydra.com
