Hello, We add recently enhanced our network with Infiniband modules on a six node cluster.
We have install all OFED drivers related to our hardware We have set network IP like following : - eth : 192.168.1.0 / 255.255.255.0 - ib : 192.168.70.0 / 255.255.255.0 After first tests all seems good. IB interfaces ping each other, ssh and other king of exchanges over IB works well. Then we started to run our job thought openmpi (building with --with-openib option) and our first results were very bad. After investigations, our system have the following behaviour : - job starts over ib network (few packet are sent) - job switch to eth network (all next packet sent to these interfaces) We never specified the IP Address of our eth interfaces. We tried to launch our jobs with the following options : - mpirun -hostfile hostfile.list -mca blt openib,self /common_gfs2/script-test.sh - mpirun -hostfile hostfile.list -mca blt openib,sm,self /common_gfs2/script-test.sh - mpirun -hostfile hostfile.list -mca blt openib,self -mca btl_tcp_if_exclude lo,eth0,eth1,eth2 /common_gfs2/script-test.sh The final behaviour remain the same : job is initiated over ib and runs over eth. We grab performance tests file (osu_bw and osu_latency) and we got not so bad results (see attached files). We had tried plenty of different things but we are stuck : we don't have any error message... Thanks per advance for your help. Thierry.
# OSU MPI Latency Test (Version 2.0) # Size Latency (us) 0 9.39 1 8.98 2 6.92 4 6.94 8 6.94 16 6.99 32 7.09 64 7.30 128 7.56 256 7.70 512 8.27 1024 9.38 2048 12.14 4096 14.51 8192 19.79 16384 43.00 32768 64.82 65536 104.82 131072 164.28 262144 293.86 524288 536.71 1048576 1049.46 2097152 2213.57 4194304 3686.72
# OSU MPI Bandwidth Test (Version 2.0) # Size Bandwidth (MB/s) 1 0.180975 2 0.365537 4 0.730864 8 1.461231 16 2.920952 32 5.793988 64 11.254934 128 27.403607 256 55.811413 512 109.614427 1024 210.083847 2048 329.558204 4096 506.783138 8192 749.913297 16384 570.730147 32768 794.796561 65536 968.103658 131072 990.723946 262144 1009.216695 524288 1032.053241 1048576 1063.046034 2097152 1209.998818 4194304 1346.575306
HSN Codes ---------------------------------------- Summary of Results TCP=GigE GM/MX=Myrinet IBV/VAPI/UDAPL/PSM=Infiniband ------------------------------------------------------------------------------ Maximum Performance ------------------- GigE : 57 usec HSN-PSM : 2 usec GigE : 102 MB/s HSN-PSM : 1134 MB/s Average Performance ------------------- GigE : 57 usec HSN-PSM : 2 usec GigE : 101 MB/s HSN-PSM : 1124 MB/s Minimum Performance ------------------- GigE : 57 usec HSN-PSM : 2 usec GigE : 100 MB/s HSN-PSM : 1115 MB/s