Re: [OMPI users] Why communication performance change with binding PEs?

Gilles Gouaillardet Thu, 23 Jun 2016 02:18:23 -0400 (EDT)

Java uses *many* threads, simply

ls /proc/<pid>/tasks


and you will be amazed at how many threads are used.
Here is my guess,


from the point of view of a given MPI process :

in case 1, the main thread and all the other threads do time sharing, sobasically, when an other thread is working, the main thread is blocked.

in case 2, some parallelism is possible if an other MPI task is sleeping: main thread is running, and an other thread is running on an other core


in case 3, the main thread can move from on core to an other
=> cache flush
=> QPI access if used memory is no more local

so though there is more opportunity for parallelism, process migrationcan slow down everything

bottom line, event with one thread, case 1 and case 2 are quitedifferent because Java uses so many threads per process, so i am not sosurprised with the difference in performance.


if you have any chance, i suggest you write a similar program in C.

since only a few threads are use per process, i guess case 1 and case 2will become pretty close.

i also suggest that for cases 2 and 3, you bind processes to a socketinstead of no binding at all


Cheers,

Gilles

On 6/23/2016 2:41 PM, Saliya Ekanayake wrote:

Thank you, Gilles for the quick response. The code comes from aclustering application, bu let me try to explain simply what thepattern is. It's a bit long than I expected.
The program has the pattern BSP pattern with /compute()/ followed bycollective /allreduce()/ And it does many iterations over these two.
Each process is a Java process with just the main thread. However inJava the process and main thread have their own PIDs and act as twoLWPs in Linux.
Now, let's take two binding scenarios. For simplicity, I'll assume anode with 2 sockets each with 4-cores. The real one I ran has 2sockets with 12 cores each.
1. *--map-by ppr:8:node:PE=1 --bind-to core* results in something likebelow.
Inline image 3
where each process is bound to 1 core. The blue dots show the mainthread in Java. It too is bound to the same core as its parent processby default.
2. *--map-by ppr:8:node --bind-to none * This is similar to 1, butnow processes are not bound (or bound to all cores). However, from theprogram, we *explicitly bind its main thread to 1 core*. It givessomething like below.
Inline image 4
The results we got suggest approach 2 gives better communicationperformance than 1. The btl used is openib. Here's a graph showing thevariation in timings. It shows for other cases that use more than 1thread to do the computation as well. In all patterns communication isdone through the main thread only.
What is peculiar is the two points within the dotted circle.Intuitively they should overlap as it only has the main thread in eachJava process and that main is bound to 1 core. The difference is howthe parent process is bound with MPI. The red line is for *Case1* above and the blue is for *Case 2*
*
*
The green line is when both parent process and threads are unbound.


Inline image 6


/
/
/
/
On Thu, Jun 23, 2016 at 12:36 AM, Gilles Gouaillardet<gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:
    Can you please provide more details on your config, how test are
    performed and the results ?


    to be fair, you should only compare cases in which mpi tasks are
    bound to the same sockets.

    for example, if socket0 has core[0-7] and socket1 has core[8-15]

    it is fair to compare {task0,task1} bound on

    {0,8}, {[0-1],[8-9]}, {[0-7],[8-15]}

    but it is unfair to compare

    {0,1} and {0,8} or {[0-7],[8-15]}

    since {0,1} does not involve traffic on the QPI, but {0,8} does.

    depending on the btl you are using, it might involve or not an
    other "helper" thread.
    if your task is bound on one core, and assuming there is no SMT,
    then the task and the helper do time sharing.
    but if the task is bound on more than one core, then the task and
    the helper run in parallel.


    Cheers,

    Gilles

    On 6/23/2016 1:21 PM, Saliya Ekanayake wrote:
    Hi,

    I am trying to understand this peculiar behavior where the
    communication time in OpenMPI changes depending on the number of
    process elements (cores) the process is bound to.

    Is this expected?

    Thank you,
    saliya
--Saliya Ekanayake
    Ph.D. Candidate | Research Assistant
    School of Informatics and Computing | Digital Science Center
    Indiana University, Bloomington



    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/06/29523.php
    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/06/29524.php




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29529.php

Re: [OMPI users] Why communication performance change with binding PEs?

Reply via email to