Java uses *many* threads, simply

ls /proc/<pid>/tasks

and you will be amazed at how many threads are used.
Here is my guess,


from the point of view of a given MPI process :

in case 1, the main thread and all the other threads do time sharing, so basically, when an other thread is working, the main thread is blocked.

in case 2, some parallelism is possible if an other MPI task is sleeping : main thread is running, and an other thread is running on an other core

in case 3, the main thread can move from on core to an other
=> cache flush
=> QPI access if used memory is no more local
so though there is more opportunity for parallelism, process migration can slow down everything


bottom line, event with one thread, case 1 and case 2 are quite different because Java uses so many threads per process, so i am not so surprised with the difference in performance.

if you have any chance, i suggest you write a similar program in C.
since only a few threads are use per process, i guess case 1 and case 2 will become pretty close.

i also suggest that for cases 2 and 3, you bind processes to a socket instead of no binding at all

Cheers,

Gilles

On 6/23/2016 2:41 PM, Saliya Ekanayake wrote:
Thank you, Gilles for the quick response. The code comes from a clustering application, bu let me try to explain simply what the pattern is. It's a bit long than I expected.



The program has the pattern BSP pattern with /compute()/ followed by collective /allreduce()/ And it does many iterations over these two.

Each process is a Java process with just the main thread. However in Java the process and main thread have their own PIDs and act as two LWPs in Linux.

Now, let's take two binding scenarios. For simplicity, I'll assume a node with 2 sockets each with 4-cores. The real one I ran has 2 sockets with 12 cores each.

1. *--map-by ppr:8:node:PE=1 --bind-to core* results in something like below.

Inline image 3
where each process is bound to 1 core. The blue dots show the main thread in Java. It too is bound to the same core as its parent process by default.

2. *--map-by ppr:8:node --bind-to none * This is similar to 1, but now processes are not bound (or bound to all cores). However, from the program, we *explicitly bind its main thread to 1 core*. It gives something like below.

Inline image 4
The results we got suggest approach 2 gives better communication performance than 1. The btl used is openib. Here's a graph showing the variation in timings. It shows for other cases that use more than 1 thread to do the computation as well. In all patterns communication is done through the main thread only.

What is peculiar is the two points within the dotted circle. Intuitively they should overlap as it only has the main thread in each Java process and that main is bound to 1 core. The difference is how the parent process is bound with MPI. The red line is for *Case 1* above and the blue is for *Case 2*
*
*
The green line is when both parent process and threads are unbound.


Inline image 6


/
/
/
/



On Thu, Jun 23, 2016 at 12:36 AM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

    Can you please provide more details on your config, how test are
    performed and the results ?


    to be fair, you should only compare cases in which mpi tasks are
    bound to the same sockets.

    for example, if socket0 has core[0-7] and socket1 has core[8-15]

    it is fair to compare {task0,task1} bound on

    {0,8}, {[0-1],[8-9]}, {[0-7],[8-15]}

    but it is unfair to compare

    {0,1} and {0,8} or {[0-7],[8-15]}

    since {0,1} does not involve traffic on the QPI, but {0,8} does.

    depending on the btl you are using, it might involve or not an
    other "helper" thread.
    if your task is bound on one core, and assuming there is no SMT,
    then the task and the helper do time sharing.
    but if the task is bound on more than one core, then the task and
    the helper run in parallel.


    Cheers,

    Gilles

    On 6/23/2016 1:21 PM, Saliya Ekanayake wrote:
    Hi,

    I am trying to understand this peculiar behavior where the
    communication time in OpenMPI changes depending on the number of
    process elements (cores) the process is bound to.

    Is this expected?

    Thank you,
    saliya

-- Saliya Ekanayake
    Ph.D. Candidate | Research Assistant
    School of Informatics and Computing | Digital Science Center
    Indiana University, Bloomington



    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/06/29523.php


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/06/29524.php




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29529.php

Reply via email to