Re: Parallel SELECT queries in Fuseki or Jena's CLI

Jakub Jałowiec Fri, 05 Nov 2021 05:01:39 -0700

Hi Andy,
thank for your input.
Let me drill it down a little more so I understand it better. Let's say I
have the following query:

> prefix foaf: <http://xmlns.com/foaf/0.1/>
> prefix ex: <http://example.net/>
> SELECT ?person1 (count(?label) as ?labelCounter)
> WHERE {
>   ?p1 foaf:knows+ ?p2 .
>   ?p2 ex:hasLabel ?label .
> }
> GROUP BY ?p1
>

Let's also say I have multiple cores on my machine and I want to speed up a
single query just by utilizing their parallelism. The bottleneck here is
the complex property path (*foaf:knows+*).
To speed things up I want to split the search space into chunks processed
by each core. Intuitively to me the best solution would be to split by
*p1* equally
between all the cores (e.g. if I have *N* persons and *k* cores then each
core receives *N*/*k* persons to evaluate as *p1*). The figure below shows
workload distribution I am trying to achieve (green circles are persons,
brown arcs are instances of the *foaf:knows* relationship, instances of
*ex:hasLabel* have been left out).
[image: cores.png]
The query engine would start the evaluation at a "person" node (*p1* in the
query) and then just do a closure of the *foaf:knows *relationship (*p1
foaf:knows+ p2*). This would require shared memory between all the threads.
I have three questions:

   1. How would the SPARQL query engine know that it needs to split the
   workload in the 'per root of the pattern' manner and not in a different
   way? Is there a mechanism in the SPARQL interpreter for that?
   2. Can a single transaction be shared between multiple threads (cores)?
   3. Do I need a transaction if the threads I am running are guaranteed to
   not modify anything? (the query is a SELECT so it is 'read-only')

Best regards,
Jakub

niedz., 31 paź 2021 o 11:29 Andy Seaborne <a...@apache.org> napisał(a):

> Hi Jakub,
>
> The preferred way to have parallel actions on a dataset is via
> transactions.
>
> concurrency-howto covers threading within a transaction. Possible with
> further MRSW (multiple reader or single writer) locking.
>
> This is how Fuseki executes multiple requests.  Each HTTP request that
> is executing in true parallel is executed on a separate thread and
> inside a transaction.
>
> So have each thread start a transaction, execute as many sequential
> queries as it needs and end the transaction.
>
> In fact, only TDB2 enforces this; TDB1 only enforces it if it has
> already been used transactionally. Other datasets are multiple-reader
> safe anyway.  But placing inside a transaction is the correct way.
>
>       Andy
>
> On 30/10/2021 15:44, Jakub Jałowiec wrote:
> > Dear community,
> > is there any high-level user interface to execute parallel SELECT
> queries in Apache Fuseki or the CLI of Apache Jena?
> > I've found a short note on parallelism in Apache Jena here:
> https://jena.apache.org/documentation/notes/concurrency-howto.html. But
> that is not really what I am looking for as it is a general note on how to
> implement low-level parallelism in Apache Jena.
> > I am interested in analytic benchmarking of Apache Jena. Ideally, I am
> looking for something that works out-of-the-box just for SELECT queries (no
> need to modify the model in a parallel fashion or synchronize state etc.).
> >
> > I'd appreciate any suggestions or pointing out to any resources as I am
> new to Apache Jena. I couldn't find a lot in the archives of the list using
> the keywords "parallelism" and "concurrent".
> >
> > Best regards,
> > Jakub
> >
>

Re: Parallel SELECT queries in Fuseki or Jena's CLI

Reply via email to