Re: Achieving reasonably performing federated queries

Rob Vesse Thu, 25 Jul 2013 02:50:47 -0700

Can you provide examples of the query plan (the algebra) with the
optimizer on and off?


The issue is likely down to ARQs index join linearization optimization,
this works great for local data and small federated queries but can work
poorly for large federated queries.  If this is the case then the
optimized algebra will include sequence/conditional operators while the
unoptimized form will include join/leftjoin instead.

In current releases of ARQ this optimization is always on, in the
2.10.2-SNAPSHOTs you can now do the following to disable this specific
optimization:

ARQ.getContext().set(ARQ.optIndexJoinStrategy, false);

This will allow you to get the benefit of all the other optimizations that
do apply while ignoring the one that is likely causing the problem

Poorly performing federated queries is not limited to ARQ or SPARQL but is
rather a general problem of service federation.  If federated services are
the bottleneck in your application then maybe you need to consider
re-architecting to bring some data locally.

Rob


On 7/23/13 2:56 PM, "Olivier Rossel" <olivier.ros...@gmail.com> wrote:

>Same interrogations here.
>So I +1 this question immensely!
>
>
>On Tue, Jul 23, 2013 at 11:48 AM, Sarven Capadisli <i...@csarven.ca>
>wrote:
>
>> Hi all,
>>
>> This is partly a summary of my recent experiences with federated queries
>> and partly a request for your feedback on making /reasonably/ performing
>> federated queries.
>>
>> The query in question is here [1]. Essentially there are two endpoints
>> (which may or may not be the same), and they return the same pattern.
>>There
>> are millions of triples to get through, so throwing out false negatives
>> (early on) is quite important. We assume that graph names are not known
>>and
>> that everything is accessible from the default graph. The endpoint which
>> dispatches the two queries needs to filter out what's remaining. There
>>are
>> no common variables. This means that both endpoints need to do their own
>> thing and then the patterns are joined.
>>
>> Needless to say, OPTIONALs that are in there are expensive, but they
>>help
>> a great deal in making sure to use only what's necessary i.e., either a
>> refArea doesn't have an exactMatch or if there is an exactMatch, it
>> contains the domain of the refArea that's at the other endpoint. Without
>> OPTIONALs, the outer endpoint will end up with more possibilities to
>>join.
>> Using MINUS is more or less the same.
>>
>> By default, ARQ uses an optimizer to do a whole bunch of good stuff
>>that's
>> mostly foreign to me. What I'm aware of however is how it behaves when
>>it
>> comes SERVICE calls. When the first SERVICE call comes back with n
>>number
>> of triples, the second SERVICE is called n times. Undoubtedly, this
>>doesn't
>> sale at all.
>>
>> To work around this, I've turned off the optimizer with
>> Optimize.noOptimizer() [2] with a simple class which is called from the
>> parent endpoint's TDB assembler file. As expected, that allows the
>>parent
>> to make only two SERVICE calls.
>>
>> This is the current state of things. I'd like to take it further to get
>> more out of this, but at this point, I need a different set of eyes.
>>
>> [I will prepare a chart for this, but this rough explanation might do
>>for
>> now] As there are different endpoints with different amounts of data,
>>what
>> I've experienced is that some of the fastest quickest queries take
>>around 3
>> seconds. That's typically queries with low number of joins;
>>~150x150=22500
>> possibilities before the last filter kicks in. It gets heavy quite
>>fast, as
>> I've seen some queries to take 30 seconds or more.
>>
>> The TDB optimizer stats file is up to date on all endpoints.
>>
>> I am completely open to how this query can be restructured, or simply
>>like
>> to hear about your own experiences with federated queries.
>>
>> [1] http://csarven.ca/linked-**statistical-data-analysis#**
>> 
>>federated-sparql-query<http://csarven.ca/linked-statistical-data-analysis
>>#federated-sparql-query>
>> [2] http://jena.apache.org/**documentation/javadoc/arq/com/**
>> 
>>hp/hpl/jena/sparql/algebra/**optimize/Optimize.html#**noOptimizer()<http:
>>//jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/sparql/algebr
>>a/optimize/Optimize.html#noOptimizer()>
>>
>> -Sarven
>> http://csarven.ca/#i
>>
>>
>>

Re: Achieving reasonably performing federated queries

Reply via email to