good morning;

> On 2016-11-14, at 02:10, Niels Andersen <[email protected]> wrote:
> 
> Good evening to you as well Mr. Anderson,
> 
> We are building an application where we will end up with several hundreds of 
> millions of triples. So the scope of the application could be considered 
> large.
> 
> As for the initial question about model.listStatements joins, here is a code 
> snippet:
> 
> [nested loop natural join implementation…]
> If the query above returned 10,000 children in iterator1, then iterator2 will 
> be called 10,000 times. This does not seem to be very efficient. 

there is no reason to expect that it would be.
at an abstract level, it ignores two rather central principles for effective 
query processing:
- if you do not need data, do not touch it.
- if you do not need data, do not move it.

on one hand, there was this message, earlier in the thread,

>> On 2016-11-13, at 22:19, Andy Seaborne <[email protected]> wrote:
>> 
>> ARQ is either as fast at joins as listStatements (because it is using the 
>> underlying Graph.find that backs listStatement) or is faster because it 
>> avoids churning lot of unnecessary bytes.
>> 
>> As many NoSQL application have discovered, reinventing joins client side, 
>> results in a lot of data transfer from data storage to client.

which alludes to general experience in this regard.
on the other, one could perform concrete timing experiments to determine the 
respective wild-subject match rate and the statement scan rate for your 
particular repository statistics, profile the time spent where in the stack, 
and predict quantitatively, that the approach would likely underperform one 
which left the join process to mechanisms which is closer to the store and move 
less data.

> 
> To the best of my knowledge, TDB already has indexed lists of OSP, POS and 
> SPO. I would have thought that there was a way to run the second query by 
> just passing an ordered list of the objects returned in the first query. This 
> provides for far better matching than having to run the same query many 
> times. 

were that the case, the api documentation would describe it.
does it?

> 
> The alternative approach that we are looking at is to run a second query 
> where we return all the labels of all objects, store the results of each 
> query in a HashSet indexed on ObjectResource and do a RetainAll to join the 
> two sets. The problem with this is that there are way too many labels in the 
> system to do this effectively. I can also create a code snippet for this if 
> it is necessary.
> 
> So my question is: What is the correct way to join the results from two 
> model.listStatements?

my question is, why is it necessary to do that on the client side?

> 
> As for the initial question about model.listStatements filtering, here is a 
> code snippet:
> 
>       StmtIterator iterator1 = model.listStatements(
>               new SimpleSelector(nodeResource, MY_VOCAB.value, (RDFNode)null)
>               {
>                       public boolean selects(Statement s)
>                       {
>                               // return the object literals > 12345
>                               return (s.getObject().asLiteral().getInt() > 
> 12345); 
>                       }
>               });
> In the query above; for every value result, the selector has to do a 
> comparison with the filter value. I would have thought that it was easier for 
> TDB to do the filtering, than to include it in a SimpleSelector.
> 
> My question is: What is the correct way to implement filtering?

while “correct” depends much on the concrete case, the method, above, relies on 
the same problematic approach as your join implementation, yet it makes is no 
case for a mechanism which performs the work on the client side rather than 
leave it to a query processor.

> 
> As for the long list in my email that I accidentally sent multiple times; I 
> hope the concerns and questions are clear enough to be answered. Let me know 
> if clarification is needed.

those are the points with permit commiseration only.
so long as they remain abstract complaints, it is difficult to bring experience 
to bear on them.
my experience differs from yours in significant ways, but without concrete 
information, it is not possible to explore, why.

your described case would appear to require a query with a single bgp, which 
contains two statement patterns and a filter.
given that case, your complaints leave the impression, that the sparql 
processor executed queries of that form less effectively than you expected 
and/or was not stable in the process.
at the level of detail which you have supplied, i would not expect that to have 
been the case.
you will need to say more.

best regards, from berlin,

---
james anderson | [email protected] | http://dydra.com





Reply via email to