Re: Parallel SELECT queries in Fuseki or Jena's CLI

Andy Seaborne Sat, 06 Nov 2021 03:34:29 -0700

Hi Jakub,

For this data, what does it look like?
  How deep are the foaf:knows chains?
  How many foaf:knows triples overall are there?
  Are there cycles?
  (presumably there are shared
   substrees p1->p2->p3 and p1->p4->p3
   as well)

If designing a system bottom-to-top for the :p+ and other path-likequeries, the data can be arranged in datastructures suitable topartitioning across CPUs. Neither the in-memory nor TDB data storage isdesigned for that. Mid-scale LPG systems are, typically (adjacencylayout). That design favours one set of use cases over others.

The actual work per step per node is small so if there is poorco-locality to a CPU, then there is a good chance that inter-threadcosts are going to dominate. Project Loom (java virtual threads) willhelp but not solve that because thread switching will be in the JVM, notthe OS kernel.

There is also the compute synchronization costs and related impact onCPU caches. Stalling to read memory (or competing for L2 cache) or tosynchronize between threads is expensive because it breaks up thesingle-thread execution pipeline on a core.


So the layout across cores benefits from clustering.

If it done during query execution, rather than ahead-of-time, that morecost.

It's a tradeoff - the other use of multiple cores is serving multiplerequests concurrently - the Fuseki case.


At the moment:

I'm sure that path evaluation could be improved.

A good first step is to check that there aren't any basic, silly errorsin the existing code (it wouldn't be the first time). If none, thenthere are specific algorithms like Floyd–Warshall (and I think there areparallel versions of that).


    Andy


On 05/11/2021 12:25, Jakub Jałowiec wrote:

I've noticed that the figure from my last message has not been sentproperly, here is a link to it: https://kubajal.github.io/covidepid/<https://kubajal.github.io/covidepid/>


Best regards,
Jakub

pt., 5 lis 2021 o 13:00 Jakub Jałowiec <[email protected]<mailto:[email protected]>> napisał(a):


    Hi Andy,
    thank for your input.
    Let me drill it down a little more so I understand it better. Let's
    say I have the following query:

        prefix foaf: <http://xmlns.com/foaf/0.1/
        <http://xmlns.com/foaf/0.1/>>
        prefix ex: <http://example.net/ <http://example.net/>>
        SELECT ?person1 (count(?label) as ?labelCounter)
        WHERE {
           ?p1 foaf:knows+ ?p2 .
           ?p2 ex:hasLabel ?label .
        }
        GROUP BY ?p1


    Let's also say I have multiple cores on my machine and I want to
    speed up a single query just by utilizing their parallelism. The
    bottleneck here is the complex property path (/foaf:knows+/).
    To speed things up I want to split the search space into chunks
    processed by each core. Intuitively to me the best solution would be
    to split by /p1/ equally between all the cores (e.g. if I have /N/
    persons and /k/ cores then each core receives /N///k/ persons to
    evaluate as /p1/). The figure below shows workload distribution I am
    trying to achieve (green circles are persons, brown arcs are
    instances of the /foaf:knows/ relationship, instances of
    /ex:hasLabel/ have been left out).
    cores.png
    The query engine would start the evaluation at a "person" node
    (/p1/ in the query) and then just do a closure of the /foaf:knows
    /relationship (/p1 foaf:knows+ p2/). This would require shared
    memory between all the threads.
    I have three questions:

     1. How would the SPARQL query engine know that it needs to split
        the workload in the 'per root of the pattern' manner and not in
        a different way? Is there a mechanism in the SPARQL interpreter
        for that?
     2. Can a single transaction be shared between multiple threads (cores)?
     3. Do I need a transaction if the threads I am running are
        guaranteed to not modify anything? (the query is a SELECT so it
        is 'read-only')

    Best regards,
    Jakub

    niedz., 31 paź 2021 o 11:29 Andy Seaborne <[email protected]
    <mailto:[email protected]>> napisał(a):

        Hi Jakub,

        The preferred way to have parallel actions on a dataset is via
        transactions.

        concurrency-howto covers threading within a transaction.
        Possible with
        further MRSW (multiple reader or single writer) locking.

        This is how Fuseki executes multiple requests.  Each HTTP
        request that
        is executing in true parallel is executed on a separate thread and
        inside a transaction.

        So have each thread start a transaction, execute as many sequential
        queries as it needs and end the transaction.

        In fact, only TDB2 enforces this; TDB1 only enforces it if it has
        already been used transactionally. Other datasets are
        multiple-reader
        safe anyway.  But placing inside a transaction is the correct way.

               Andy

        On 30/10/2021 15:44, Jakub Jałowiec wrote:
         > Dear community,
         > is there any high-level user interface to execute parallel
        SELECT queries in Apache Fuseki or the CLI of Apache Jena?
         > I've found a short note on parallelism in Apache Jena here:
        https://jena.apache.org/documentation/notes/concurrency-howto.html
        <https://jena.apache.org/documentation/notes/concurrency-howto.html>.
        But that is not really what I am looking for as it is a general
        note on how to implement low-level parallelism in Apache Jena.
         > I am interested in analytic benchmarking of Apache Jena.
        Ideally, I am looking for something that works out-of-the-box
        just for SELECT queries (no need to modify the model in a
        parallel fashion or synchronize state etc.).
         >
         > I'd appreciate any suggestions or pointing out to any
        resources as I am new to Apache Jena. I couldn't find a lot in
        the archives of the list using the keywords "parallelism" and
        "concurrent".
         >
         > Best regards,
         > Jakub
         >

Re: Parallel SELECT queries in Fuseki or Jena's CLI

Reply via email to