Re: [OMPI devel] OpenMPI Performance Problem with Open|SpeedShop

William Hachfeld Tue, 13 Jan 2009 22:18:36 -0500


Jeff & George,

> Hum; interesting. I can't think of any reason why that would be aproblem offhand. The> mca_btl_sm_component_progress() function is the shared memoryprogression function.> opal_progress() and mca_bml_r2_progress() are likely mainlydispatching off to this

> function.
>

> Does OSS interfere with shared memory between processes in anyway? (I'm not enough> of a kernel guy to know what the ramifications of ptrace andwhatnot are)

Open|SS shouldn't interfere with shared memory. We use the pthreadlibrary to access some TLS, but no shared memory...

> There might be one reason to slowdown the application quite a bit.If the fact that you're> using timer interact with the libevent (the library we're using tointernally manage any kind> of events), then we might end-up in the situation where we call thepoll for every iteration

> in the event library. And this is really expensive.

I did contemplate the notion that maybe we were getting into the"progress monitoring" part of OpenMPI every time the timer interruptedthe process (1000s of times per second). Can either of you see anymechanism by which that might happen?

> A quick way to figure out if this is that case is to run Open MPIwithout support for shared> memory (--mca btl ^sm). This way we will call poll on a regularbasis anyway, and if there> is no difference between a normal run and a OSS one, we know atleast where to start

> looking ...

I ran SMG2000 on an 8-CPU Yellowrail node in the two configurationsand recorded the wall/cpu clock times as reported by SMG2000 itself:


"mpirun -np 8 smg2000 -n 32 64 64"

        Struct Interface, wall clock time = 0.042348 seconds
        Struct Interface, cpu clock time = 0.040000 seconds
        SMG Setup, wall clock time =0.732441 seconds
        SMG Setup, cpu clock time = 0.730000 seconds
        SMG Solve, wall clock time = 6.881814 seconds
        SMG Solve, cpu clock time =6.880000 seconds

"mpirun --mca btl ^sm -np 8 smg2000 -n 64 64 64"

        Struct Interface, wall clock time = 0.059137 seconds
        Struct Interface, cpu clock time = 0.060000 seconds
        SMG Setup, wall clock time = 0.931437 seconds
        SMG Setup, cpu clock time = 0.930000 seconds
        SMG Solve, wall clock time = 9.107343 seconds
        SMG Solve, cpu clock time = 9.110000 seconds

But running the application with the "--mac btl ^sm" option insideOpen|SS also results in an extreme slowdown. I.e. it doesn't make anydifference whether the shared memory transport is enabled or not. Open|SS reports time spent as follows (in case this helps pinpoint what isgoing on inside OpenMPI):


        Exclusive CPU
        time in seconds.                        Function (defining location)

        364.050000                              btl_openib_component_proress 
(libmpi.so.0)
        165.890000                              mthca_poll_cq 
(libmthca-rdmav2.so)
        122.090000                              pthread_spin_lock 
(libpthread.so.0)
        90.790000                               opal_progress (libopen-pal.so.0)
        48.230000                               mca_bml_r2_progress 
(libmpi.so.0)
        30.880000                               ompi_request_wait_all 
(libmpi.so.0)
        9.780000                                pthread_spin_unlock 
(libpthread.so.0)
        4.910000                                mthca_free_srq_wqe 
(libmthca-rdmav2.so)
        4.910000                                mthca_unlock_cqs 
(libmthca-rdmav2.so)
        4.730000                                mthca_lock_cqs 
(libmthca-rdmav2.so)
        0.890000                                __poll (libc.so.6)
        ...

Does this help at all?


-- Bill Hachfeld, The Open|SpeedShop Project

Re: [OMPI devel] OpenMPI Performance Problem with Open|SpeedShop

Reply via email to