Re: Parallel SELECT queries in Fuseki or Jena's CLI

Jakub Jałowiec Fri, 05 Nov 2021 05:25:39 -0700

I've noticed that the figure from my last message has not been sent
properly, here is a link to it: https://kubajal.github.io/covidepid/


Best regards,
Jakub

pt., 5 lis 2021 o 13:00 Jakub Jałowiec <j.jalow...@student.uw.edu.pl>
napisał(a):

> Hi Andy,
> thank for your input.
> Let me drill it down a little more so I understand it better. Let's say I
> have the following query:
>
>> prefix foaf: <http://xmlns.com/foaf/0.1/>
>> prefix ex: <http://example.net/>
>> SELECT ?person1 (count(?label) as ?labelCounter)
>> WHERE {
>>   ?p1 foaf:knows+ ?p2 .
>>   ?p2 ex:hasLabel ?label .
>> }
>> GROUP BY ?p1
>>
>
> Let's also say I have multiple cores on my machine and I want to speed up
> a single query just by utilizing their parallelism. The bottleneck here is
> the complex property path (*foaf:knows+*).
> To speed things up I want to split the search space into chunks processed
> by each core. Intuitively to me the best solution would be to split by
> *p1* equally between all the cores (e.g. if I have *N* persons and *k*
> cores then each core receives *N*/*k* persons to evaluate as *p1*). The
> figure below shows workload distribution I am trying to achieve (green
> circles are persons, brown arcs are instances of the *foaf:knows* 
> relationship,
> instances of *ex:hasLabel* have been left out).
> [image: cores.png]
> The query engine would start the evaluation at a "person" node (*p1* in
> the query) and then just do a closure of the *foaf:knows *relationship (*p1
> foaf:knows+ p2*). This would require shared memory between all the
> threads.
> I have three questions:
>
>    1. How would the SPARQL query engine know that it needs to split the
>    workload in the 'per root of the pattern' manner and not in a different
>    way? Is there a mechanism in the SPARQL interpreter for that?
>    2. Can a single transaction be shared between multiple threads (cores)?
>    3. Do I need a transaction if the threads I am running are guaranteed
>    to not modify anything? (the query is a SELECT so it is 'read-only')
>
> Best regards,
> Jakub
>
> niedz., 31 paź 2021 o 11:29 Andy Seaborne <a...@apache.org> napisał(a):
>
>> Hi Jakub,
>>
>> The preferred way to have parallel actions on a dataset is via
>> transactions.
>>
>> concurrency-howto covers threading within a transaction. Possible with
>> further MRSW (multiple reader or single writer) locking.
>>
>> This is how Fuseki executes multiple requests.  Each HTTP request that
>> is executing in true parallel is executed on a separate thread and
>> inside a transaction.
>>
>> So have each thread start a transaction, execute as many sequential
>> queries as it needs and end the transaction.
>>
>> In fact, only TDB2 enforces this; TDB1 only enforces it if it has
>> already been used transactionally. Other datasets are multiple-reader
>> safe anyway.  But placing inside a transaction is the correct way.
>>
>>       Andy
>>
>> On 30/10/2021 15:44, Jakub Jałowiec wrote:
>> > Dear community,
>> > is there any high-level user interface to execute parallel SELECT
>> queries in Apache Fuseki or the CLI of Apache Jena?
>> > I've found a short note on parallelism in Apache Jena here:
>> https://jena.apache.org/documentation/notes/concurrency-howto.html. But
>> that is not really what I am looking for as it is a general note on how to
>> implement low-level parallelism in Apache Jena.
>> > I am interested in analytic benchmarking of Apache Jena. Ideally, I am
>> looking for something that works out-of-the-box just for SELECT queries (no
>> need to modify the model in a parallel fashion or synchronize state etc.).
>> >
>> > I'd appreciate any suggestions or pointing out to any resources as I am
>> new to Apache Jena. I couldn't find a lot in the archives of the list using
>> the keywords "parallelism" and "concurrent".
>> >
>> > Best regards,
>> > Jakub
>> >
>>
>

Re: Parallel SELECT queries in Fuseki or Jena's CLI

Reply via email to