Hi,
I'm using Openmpi in a linux cluster (itanium 64, intel compilers, 8
processors (4 dual) by node) in which openmpi is not the default ( I
mean supported) MPI-II implementation. Openmpi has been installed easily
on the cluster but I think there is a problem with the configuration.
I'm using two mpi codes : The first is a CFD code with a master/slave
structure... I have done some calculations on 128 proc's... 1 master
process and 127 slaves. Openmpi is slightly more efficient than the
supported MPI-II version.
Then I've moved to a second solver (radiant heat transfer ) ... In this
case, all the processors are doing the same thing. I have found that
after the initial phase of data reading some processors start to work
hard and the others (even consuming 99 of CPU) are waiting for
something! In fact I have 15 processes over 32 which are working (all
the processes are consuming 99% of CPU...) then as soon as they finish
the calculation the other processes start to do the job (in fact 12
processes) and then when these 12 start to finish the remaining 4 do the
job....
When looking to the computational time, I obtain that with the MPI-II
official version on the cluster...
output.000: temps apres petits calculs = 170.445202827454
output.001: temps apres petits calculs = 170.657078027725
output.002: temps apres petits calculs = 168.880963802338
output.003: temps apres petits calculs = 172.611718893051
output.004: temps apres petits calculs = 169.420207977295
output.005: temps apres petits calculs = 168.880684852600
output.006: temps apres petits calculs = 170.222792863846
output.007: temps apres petits calculs = 172.987339973450
output.008: temps apres petits calculs = 170.321479082108
output.009: temps apres petits calculs = 167.417831182480
output.010: temps apres petits calculs = 170.633100032806
output.011: temps apres petits calculs = 168.988963842392
output.012: temps apres petits calculs = 166.893934011459
output.013: temps apres petits calculs = 169.844722032547
output.014: temps apres petits calculs = 169.541869163513
output.015: temps apres petits calculs = 166.023182868958
output.016: temps apres petits calculs = 166.047858953476
output.017: temps apres petits calculs = 166.298271894455
output.018: temps apres petits calculs = 166.990653991699
output.019: temps apres petits calculs = 170.565690040588
output.020: temps apres petits calculs = 170.455694913864
output.021: temps apres petits calculs = 170.545780897141
output.022: temps apres petits calculs = 165.962821960449
output.023: temps apres petits calculs = 169.934472084045
output.024: temps apres petits calculs = 170.169304847717
output.025: temps apres petits calculs = 172.316897153854
output.026: temps apres petits calculs = 166.030095100403
output.027: temps apres petits calculs = 168.219340801239
output.028: temps apres petits calculs = 165.486129045486
output.029: temps apres petits calculs = 165.923212051392
output.030: temps apres petits calculs = 165.996737957001
output.031: temps apres petits calculs = 167.544650793076
all the processes are more or less consuming the same CPU time
and with Openmpi I've obtained that
output.000: temps apres petits calculs = 158.906322956085
output.001: temps apres petits calculs = 160.753660202026
output.002: temps apres petits calculs = 161.286659002304
output.003: temps apres petits calculs = 169.431221961975
output.004: temps apres petits calculs = 163.511161088943
output.005: temps apres petits calculs = 160.547757863998
output.006: temps apres petits calculs = 161.222673892975
output.007: temps apres petits calculs = 325.977787017822
output.008: temps apres petits calculs = 321.527663946152
output.009: temps apres petits calculs = 326.429191827774
output.010: temps apres petits calculs = 321.229686975479
output.011: temps apres petits calculs = 160.507288932800
output.012: temps apres petits calculs = 158.480596065521
output.013: temps apres petits calculs = 169.135869979858
output.014: temps apres petits calculs = 158.526450872421
output.015: temps apres petits calculs = 486.637645006180
output.016: temps apres petits calculs = 483.884088993073
output.017: temps apres petits calculs = 480.200496196747
output.018: temps apres petits calculs = 483.166898012161
output.019: temps apres petits calculs = 323.687628030777
output.020: temps apres petits calculs = 319.833092927933
output.021: temps apres petits calculs = 329.558218955994
output.022: temps apres petits calculs = 329.199027061462
output.023: temps apres petits calculs = 322.116630077362
output.024: temps apres petits calculs = 322.238983869553
output.025: temps apres petits calculs = 322.890433073044
output.026: temps apres petits calculs = 322.439801216125
output.027: temps apres petits calculs = 157.899522066116
output.028: temps apres petits calculs = 159.247365951538
output.029: temps apres petits calculs = 158.351451158524
output.030: temps apres petits calculs = 158.714610815048
output.031: temps apres petits calculs = 480.177379846573
15 processes have similar times (close to those obtained with the
official MPI), hen 12, then 4 as explained previously.
I suppose that we need to tune the configuration of openmpi. Do you know
how to do?
Thanks in advance
JC