On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann <treum...@us.ibm.com> wrote:
> Bugs are always a possibility but unless there is something very unusual
> about the cluster and interconnect or this is an unstable version of MPI, it
> seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2
> MB message would trip on one.  80 tasks is a very small number in modern
> parallel computing.  Thousands of tasks involved in an MPI collective has
> become pretty standard.

Here's something absolutely strange that I accidentally stumbled upon:

I ran the test  again, but accidentally forgot to kill the
user-jobs already running on the test servers (via. Torque and our
usual queues).
I was about to kick myself, but I couldn't believe that the test
actually completes! I mean the timings are horribly bad but the test
( for the first time ) runs to completion. How could this be happening?
Doesn't make sense to me that the test completes when the
cards+servers+network is loaded but not otherwise! But I repeated the
experiment many times and still the same result.

# /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast
[snip]
# Bcast
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.02         0.02
            1           34    546807.94    626743.09    565196.07
            2           34     37159.11     52942.09     44910.73
            4           34     19777.97     40382.53     29656.53
            8           34     36060.21     53265.27     43909.68
           16           34     11765.59     31912.50     19611.75
           32           34     23530.79     41176.94     32532.89
           64           34     11735.91     23529.02     16552.16
          128           34     47998.44     59323.76     55164.14
          256           34     18121.96     30500.15     25528.95
          512           34     20072.76     33787.32     26786.55
         1024           34     39737.29     55589.97     45704.99
         2048            9     77787.56    150555.66    118741.83
         4096            9     44444.67    118331.78     77201.40
         8192            9     80835.66    166666.56    133781.08
        16384            9     77032.88    149890.66    119558.73
        32768            9    111819.45    177778.99    149048.91
        65536            9    159304.67    222298.99    195071.34
       131072            9    172941.13    262216.57    218351.14
       262144            9    161371.65    266703.79    223514.31
       524288            2       497.46   4402568.94   2183980.20
      1048576            2      5401.49   3519284.01   1947754.45
      2097152            2     75251.10   4137861.49   2220910.50
      4194304            2     33270.48   4601072.91   2173905.32
# All processes entering MPI_Finalize

Another observation is that if I replace the openib BTL with the tcp
BTL the tests run OK.


-- 
Rahul

Reply via email to