Hi,

I'm using Openmpi in a linux cluster (itanium 64, intel compilers, 8 processors (4 dual) by node) in which openmpi is not the default ( I mean supported) MPI-II implementation. Openmpi has been installed easily on the cluster but I think there is a problem with the configuration.

I'm using two mpi codes : The first is a CFD code with a master/slave structure... I have done some calculations on 128 proc's... 1 master process and 127 slaves. Openmpi is slightly more efficient than the supported MPI-II version.

Then I've moved to a second solver (radiant heat transfer ) ... In this case, all the processors are doing the same thing. I have found that after the initial phase of data reading some processors start to work hard and the others (even consuming 99 of CPU) are waiting for something! In fact I have 15 processes over 32 which are working (all the processes are consuming 99% of CPU...) then as soon as they finish the calculation the other processes start to do the job (in fact 12 processes) and then when these 12 start to finish the remaining 4 do the job....

When looking to the computational time, I obtain that with the MPI-II official version on the cluster...

output.000: temps apres petits calculs =    170.445202827454
output.001: temps apres petits calculs =    170.657078027725
output.002: temps apres petits calculs =    168.880963802338
output.003: temps apres petits calculs =    172.611718893051
output.004: temps apres petits calculs =    169.420207977295
output.005: temps apres petits calculs =    168.880684852600
output.006: temps apres petits calculs =    170.222792863846
output.007: temps apres petits calculs =    172.987339973450
output.008: temps apres petits calculs =    170.321479082108
output.009: temps apres petits calculs =    167.417831182480
output.010: temps apres petits calculs =    170.633100032806
output.011: temps apres petits calculs =    168.988963842392
output.012: temps apres petits calculs =    166.893934011459
output.013: temps apres petits calculs =    169.844722032547
output.014: temps apres petits calculs =    169.541869163513
output.015: temps apres petits calculs =    166.023182868958
output.016: temps apres petits calculs =    166.047858953476
output.017: temps apres petits calculs =    166.298271894455
output.018: temps apres petits calculs =    166.990653991699
output.019: temps apres petits calculs =    170.565690040588
output.020: temps apres petits calculs =    170.455694913864
output.021: temps apres petits calculs =    170.545780897141
output.022: temps apres petits calculs =    165.962821960449
output.023: temps apres petits calculs =    169.934472084045
output.024: temps apres petits calculs =    170.169304847717
output.025: temps apres petits calculs =    172.316897153854
output.026: temps apres petits calculs =    166.030095100403
output.027: temps apres petits calculs =    168.219340801239
output.028: temps apres petits calculs =    165.486129045486
output.029: temps apres petits calculs =    165.923212051392
output.030: temps apres petits calculs =    165.996737957001
output.031: temps apres petits calculs =    167.544650793076

all the processes are more or less consuming the same CPU time

and with Openmpi I've obtained that

output.000: temps apres petits calculs =    158.906322956085
output.001: temps apres petits calculs =    160.753660202026
output.002: temps apres petits calculs =    161.286659002304
output.003: temps apres petits calculs =    169.431221961975
output.004: temps apres petits calculs =    163.511161088943
output.005: temps apres petits calculs =    160.547757863998
output.006: temps apres petits calculs =    161.222673892975
output.007: temps apres petits calculs =    325.977787017822
output.008: temps apres petits calculs =    321.527663946152
output.009: temps apres petits calculs =    326.429191827774
output.010: temps apres petits calculs =    321.229686975479
output.011: temps apres petits calculs =    160.507288932800
output.012: temps apres petits calculs =    158.480596065521
output.013: temps apres petits calculs =    169.135869979858
output.014: temps apres petits calculs =    158.526450872421
output.015: temps apres petits calculs =    486.637645006180
output.016: temps apres petits calculs =    483.884088993073
output.017: temps apres petits calculs =    480.200496196747
output.018: temps apres petits calculs =    483.166898012161
output.019: temps apres petits calculs =    323.687628030777
output.020: temps apres petits calculs =    319.833092927933
output.021: temps apres petits calculs =    329.558218955994
output.022: temps apres petits calculs =    329.199027061462
output.023: temps apres petits calculs =    322.116630077362
output.024: temps apres petits calculs =    322.238983869553
output.025: temps apres petits calculs =    322.890433073044
output.026: temps apres petits calculs =    322.439801216125
output.027: temps apres petits calculs =    157.899522066116
output.028: temps apres petits calculs =    159.247365951538
output.029: temps apres petits calculs =    158.351451158524
output.030: temps apres petits calculs =    158.714610815048
output.031: temps apres petits calculs =    480.177379846573

15 processes have similar times (close to those obtained with the official MPI), hen 12, then 4 as explained previously.

I suppose that we need to tune the configuration of openmpi. Do you know how to do?

Thanks in advance

JC


Reply via email to