Jeff & George,


> Hum; interesting. I can't think of any reason why that would be a problem offhand. The > mca_btl_sm_component_progress() function is the shared memory progression function. > opal_progress() and mca_bml_r2_progress() are likely mainly dispatching off to this
> function.
>
> Does OSS interfere with shared memory between processes in any way? (I'm not enough > of a kernel guy to know what the ramifications of ptrace and whatnot are)

Open|SS shouldn't interfere with shared memory. We use the pthread library to access some TLS, but no shared memory...


> There might be one reason to slowdown the application quite a bit. If the fact that you're > using timer interact with the libevent (the library we're using to internally manage any kind > of events), then we might end-up in the situation where we call the poll for every iteration
> in the event library. And this is really expensive.

I did contemplate the notion that maybe we were getting into the "progress monitoring" part of OpenMPI every time the timer interrupted the process (1000s of times per second). Can either of you see any mechanism by which that might happen?


> A quick way to figure out if this is that case is to run Open MPI without support for shared > memory (--mca btl ^sm). This way we will call poll on a regular basis anyway, and if there > is no difference between a normal run and a OSS one, we know at least where to start
> looking ...

I ran SMG2000 on an 8-CPU Yellowrail node in the two configurations and recorded the wall/cpu clock times as reported by SMG2000 itself:

"mpirun -np 8 smg2000 -n 32 64 64"

        Struct Interface, wall clock time = 0.042348 seconds
        Struct Interface, cpu clock time = 0.040000 seconds
        SMG Setup, wall clock time =0.732441 seconds
        SMG Setup, cpu clock time = 0.730000 seconds
        SMG Solve, wall clock time = 6.881814 seconds
        SMG Solve, cpu clock time =6.880000 seconds

"mpirun --mca btl ^sm -np 8 smg2000 -n 64 64 64"

        Struct Interface, wall clock time = 0.059137 seconds
        Struct Interface, cpu clock time = 0.060000 seconds
        SMG Setup, wall clock time = 0.931437 seconds
        SMG Setup, cpu clock time = 0.930000 seconds
        SMG Solve, wall clock time = 9.107343 seconds
        SMG Solve, cpu clock time = 9.110000 seconds

But running the application with the "--mac btl ^sm" option inside Open|SS also results in an extreme slowdown. I.e. it doesn't make any difference whether the shared memory transport is enabled or not. Open| SS reports time spent as follows (in case this helps pinpoint what is going on inside OpenMPI):

        Exclusive CPU
        time in seconds.                        Function (defining location)

        364.050000                              btl_openib_component_proress 
(libmpi.so.0)
        165.890000                              mthca_poll_cq 
(libmthca-rdmav2.so)
        122.090000                              pthread_spin_lock 
(libpthread.so.0)
        90.790000                               opal_progress (libopen-pal.so.0)
        48.230000                               mca_bml_r2_progress 
(libmpi.so.0)
        30.880000                               ompi_request_wait_all 
(libmpi.so.0)
        9.780000                                pthread_spin_unlock 
(libpthread.so.0)
        4.910000                                mthca_free_srq_wqe 
(libmthca-rdmav2.so)
        4.910000                                mthca_unlock_cqs 
(libmthca-rdmav2.so)
        4.730000                                mthca_lock_cqs 
(libmthca-rdmav2.so)
        0.890000                                __poll (libc.so.6)
        ...

Does this help at all?


-- Bill Hachfeld, The Open|SpeedShop Project






Reply via email to