Thanks Mr. Anderson,

If I understand your comments correctly, then you say that it is more efficient 
to perform these types of joins on the server side than the client side with an 
optimized query processor. I fully agree, and I would appreciate if someone 
could show me how Jena does perform that join on the server side and how this 
is different from my example.

Regarding server side processing and "central principles for effective query 
processing", you hit the nail on its head with your statement: My understanding 
is that the Jena API is operating directly on the underlying data and my 
example hence is server-side (it is in the same JVM as the TDB database, this 
is the only supported implementation of the API. See 
https://jena.apache.org/documentation/tdb/faqs.html#can-i-share-a-tdb-dataset-between-multiple-applications
 ). There is no such thing as a client version of the Jena API, only SPARQL is 
supported client side. Do I understand this correctly?

It therefore looks like we have a disconnect. I believe that the Jena API is a 
server side function and you state that the Jena API is a client side function. 
It would great to clarity into this disconnect.

Here is a basic assumption that I have: I may be incorrect, but my impression 
is that Jena is a set of APIs, not a traditional database. The RDF API is the 
core that allows interactions with triples, Jena TDB is the persistent file 
storage, Jena ARQ is the Jena implementation of SPARQL, Jena Fuseki is an 
implementation of a database, the Jena Ontology API provides a higher level 
interface to OWL and other models, and the Jena Inferencing API provides 
reasoning over the data. Users of Jena can use Jena Fuseki or build their own 
database(s). Do I understand this correctly?

Andy is answering my original question about joins, he stated that Jena ARQ is 
using the Jena API, Graph.find and listStatement (you included this in your 
response). Again, if I understand this correctly, then Jena ARQ does not 
implement a join algorithm based on two sorted lists, so the join must be 
performed using lookups for each element returned from the first list (like I 
showed in my example). While this is OK for small datasets, it becomes 
problematic for large datasets. Do I understand this correctly?

Regarding my statement "TDB already has indexed lists of OSP, POS and SPO" and 
your response "were that the case, the api documentation would describe it. 
does it?". The documentation refers to this here 
https://jena.apache.org/documentation/tdb/store-parameters.html and in passing 
other places, I am not aware of any place in the documentation where it states 
how the API is using these indexes. Again, this goes back to the core of my 
question of effective joins which are hard to do without indexed data.

My goal is to understand how to best use Jena. There may be places that Jena is 
not a good fit, that is OK, I just need to know where those places are so that 
we can work around them or avoid them.

My gut feeling is that Jena is a great choice when the user needs to 
follow-her-nose into data and not return large datasets. If the queries are 
small, specific and returns a small set of data then Jena will provide good 
performance. If extensive joins are needed or large datasets are returned, then 
the user have to think about which API to use (core or ARQ); there will be 
situations where Jena does not provide the optimal solution and may not be the 
right choice.

Finally, regarding my list of concerns and questions. Let's start with a 
specific one: Is there a SPARQL equivalent to SQL views, functions and stored 
procedures? I believe that the answer is no, and if it is then what is the best 
practice to provide this functionality?

Again; thanks for your help.

Best regards,
Niels


-----Original Message-----
From: james anderson [mailto:[email protected]] 
Sent: Sunday, November 13, 2016 23:56
To: [email protected]
Subject: Re: How do I do a join between multiple model.listStatments calls?

good morning;

> On 2016-11-14, at 02:10, Niels Andersen <[email protected]> wrote:
> 
> Good evening to you as well Mr. Anderson,
> 
> We are building an application where we will end up with several hundreds of 
> millions of triples. So the scope of the application could be considered 
> large.
> 
> As for the initial question about model.listStatements joins, here is a code 
> snippet:
> 
> [nested loop natural join implementation…] If the query above returned 
> 10,000 children in iterator1, then iterator2 will be called 10,000 times. 
> This does not seem to be very efficient.

there is no reason to expect that it would be.
at an abstract level, it ignores two rather central principles for effective 
query processing:
- if you do not need data, do not touch it.
- if you do not need data, do not move it.

on one hand, there was this message, earlier in the thread,

>> On 2016-11-13, at 22:19, Andy Seaborne <[email protected]> wrote:
>> 
>> ARQ is either as fast at joins as listStatements (because it is using the 
>> underlying Graph.find that backs listStatement) or is faster because it 
>> avoids churning lot of unnecessary bytes.
>> 
>> As many NoSQL application have discovered, reinventing joins client side, 
>> results in a lot of data transfer from data storage to client.

which alludes to general experience in this regard.
on the other, one could perform concrete timing experiments to determine the 
respective wild-subject match rate and the statement scan rate for your 
particular repository statistics, profile the time spent where in the stack, 
and predict quantitatively, that the approach would likely underperform one 
which left the join process to mechanisms which is closer to the store and move 
less data.

> 
> To the best of my knowledge, TDB already has indexed lists of OSP, POS and 
> SPO. I would have thought that there was a way to run the second query by 
> just passing an ordered list of the objects returned in the first query. This 
> provides for far better matching than having to run the same query many 
> times. 

were that the case, the api documentation would describe it.
does it?

> 
> The alternative approach that we are looking at is to run a second query 
> where we return all the labels of all objects, store the results of each 
> query in a HashSet indexed on ObjectResource and do a RetainAll to join the 
> two sets. The problem with this is that there are way too many labels in the 
> system to do this effectively. I can also create a code snippet for this if 
> it is necessary.
> 
> So my question is: What is the correct way to join the results from two 
> model.listStatements?

my question is, why is it necessary to do that on the client side?

> 
> As for the initial question about model.listStatements filtering, here is a 
> code snippet:
> 
>       StmtIterator iterator1 = model.listStatements(
>               new SimpleSelector(nodeResource, MY_VOCAB.value, (RDFNode)null)
>               {
>                       public boolean selects(Statement s)
>                       {
>                               // return the object literals > 12345
>                               return (s.getObject().asLiteral().getInt() > 
> 12345); 
>                       }
>               });
> In the query above; for every value result, the selector has to do a 
> comparison with the filter value. I would have thought that it was easier for 
> TDB to do the filtering, than to include it in a SimpleSelector.
> 
> My question is: What is the correct way to implement filtering?

while “correct” depends much on the concrete case, the method, above, relies on 
the same problematic approach as your join implementation, yet it makes is no 
case for a mechanism which performs the work on the client side rather than 
leave it to a query processor.

> 
> As for the long list in my email that I accidentally sent multiple times; I 
> hope the concerns and questions are clear enough to be answered. Let me know 
> if clarification is needed.

those are the points with permit commiseration only.
so long as they remain abstract complaints, it is difficult to bring experience 
to bear on them.
my experience differs from yours in significant ways, but without concrete 
information, it is not possible to explore, why.

your described case would appear to require a query with a single bgp, which 
contains two statement patterns and a filter.
given that case, your complaints leave the impression, that the sparql 
processor executed queries of that form less effectively than you expected 
and/or was not stable in the process.
at the level of detail which you have supplied, i would not expect that to have 
been the case.
you will need to say more.

best regards, from berlin,

---
james anderson | [email protected] | http://dydra.com





Reply via email to