On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann <treum...@us.ibm.com> wrote: > Bugs are always a possibility but unless there is something very unusual > about the cluster and interconnect or this is an unstable version of MPI, it > seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2 > MB message would trip on one. 80 tasks is a very small number in modern > parallel computing. Thousands of tasks involved in an MPI collective has > become pretty standard.
Here's something absolutely strange that I accidentally stumbled upon: I ran the test again, but accidentally forgot to kill the user-jobs already running on the test servers (via. Torque and our usual queues). I was about to kick myself, but I couldn't believe that the test actually completes! I mean the timings are horribly bad but the test ( for the first time ) runs to completion. How could this be happening? Doesn't make sense to me that the test completes when the cards+servers+network is loaded but not otherwise! But I repeated the experiment many times and still the same result. # /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast [snip] # Bcast #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.02 0.02 1 34 546807.94 626743.09 565196.07 2 34 37159.11 52942.09 44910.73 4 34 19777.97 40382.53 29656.53 8 34 36060.21 53265.27 43909.68 16 34 11765.59 31912.50 19611.75 32 34 23530.79 41176.94 32532.89 64 34 11735.91 23529.02 16552.16 128 34 47998.44 59323.76 55164.14 256 34 18121.96 30500.15 25528.95 512 34 20072.76 33787.32 26786.55 1024 34 39737.29 55589.97 45704.99 2048 9 77787.56 150555.66 118741.83 4096 9 44444.67 118331.78 77201.40 8192 9 80835.66 166666.56 133781.08 16384 9 77032.88 149890.66 119558.73 32768 9 111819.45 177778.99 149048.91 65536 9 159304.67 222298.99 195071.34 131072 9 172941.13 262216.57 218351.14 262144 9 161371.65 266703.79 223514.31 524288 2 497.46 4402568.94 2183980.20 1048576 2 5401.49 3519284.01 1947754.45 2097152 2 75251.10 4137861.49 2220910.50 4194304 2 33270.48 4601072.91 2173905.32 # All processes entering MPI_Finalize Another observation is that if I replace the openib BTL with the tcp BTL the tests run OK. -- Rahul