Dear experts, I am running mageia2 linux distribution which comes with kernel 3.3.6.
I downloaded ofed 1.5.4.1 drivers and compiled and installed (** with a lot of pains and spec files modifications **) some of the RPM : infiniband-diags-1.5.13-1.x86_64.rpm infiniband-diags-debug-1.5.13-1.x86_64.rpm libibmad-1.3.8-1.x86_64.rpm libibmad-debug-1.3.8-1.x86_64.rpm libibmad-devel-1.3.8-1.x86_64.rpm libibmad-static-1.3.8-1.x86_64.rpm libibumad-1.3.7-1.x86_64.rpm libibumad-debug-1.3.7-1.x86_64.rpm libibumad-devel-1.3.7-1.x86_64.rpm libibumad-static-1.3.7-1.x86_64.rpm libibverbs-1.1.4-1.24.gb89d4d7.x86_64.rpm libibverbs-debug-1.1.4-1.24.gb89d4d7.x86_64.rpm libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64.rpm libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64.rpm libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64.rpm libmlx4-1.0.1-1.20.g6771d22.x86_64.rpm libmlx4-debug-1.0.1-1.20.g6771d22.x86_64.rpm libmlx4-devel-1.0.1-1.20.g6771d22.x86_64.rpm mstflint-1.4-1.18.g1adcfbf.x86_64.rpm mstflint-debug-1.4-1.18.g1adcfbf.x86_64.rpm opensm-3.3.13-1.x86_64.rpm opensm-debug-3.3.13-1.x86_64.rpm opensm-devel-3.3.13-1.x86_64.rpm opensm-libs-3.3.13-1.x86_64.rpm opensm-static-3.3.13-1.x86_64.rpm But I was **not** able to compile ofa kernel itself. Then I tried to use, instead, all the corresponding modules which come with my stock linux kernel distribution (3.3.6) After initializing correctly (I guess) all the necessary mellanox stuffs (openibd, opensm etc...) I can see my Mellanox cards with the command ibv_devinfo. I get the following output for all the computers which have a mellanox card 1) ibv_devinfo kerkira:% ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.7.000 node_guid: 0002:c903:0009:d1b2 sys_image_guid: 0002:c903:0009:d1b5 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xA0 board_id: MT_0C40110009 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 8 port_lid: 8 port_lmc: 0x00 link_layer: IB 2) ibstatus kerkira:% /usr/sbin/ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0002:c903:0009:d1b3 base lid: 0x8 sm lid: 0x8 state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand QUESTION: ==> According to these outputs, could we say that my computers use correctly the mlx4 drivers which comes with my kernel 3.3.6 ? Probably not because I cannot communicate between two machines using mpi..... Here is the detail: I compiled and install MVAPICH2 but I couldn't run "osu_bw" program between two machines, I get : kerkira% mpirun_rsh -np 2 kerkira amos ./osu_bw [cli_0]: aborting job: Fatal error in MPI_Init: Other MPI error [kerkira:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6. MPI process died? [kerkira:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [kerkira:mpispawn_0][child_handler] MPI process (rank: 0, pid: 5396) exited with status 1 [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error [amos:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died? [amos:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died? [amos:mpispawn_1][child_handler] MPI process (rank: 1, pid: 6733) exited with status 1 [amos:mpispawn_1][report_error] connect() failed: Connection refused (111) Now f I run on the **same** machine, I get the expected results: kerkira% mpirun_rsh -np 2 kerkira kerkira ./osu_bw # OSU MPI Bandwidth Test v3.6 # Size Bandwidth (MB/s) 1 5.47 2 11.34 4 22.84 8 45.89 16 91.52 32 180.27 64 350.68 128 661.78 256 1274.94 512 2283.42 1024 3936.39 2048 6362.91 4096 9159.54 8192 10737.42 16384 9246.39 32768 8869.26 65536 8707.28 131072 8942.07 262144 9009.39 524288 9060.31 1048576 9080.17 2097152 5702.06 (note: ssh between the machines kerkira and amos works correctly without password) QUESTION: ==> Why MPI programs does not work between two machines ? ==> Is it because I use the mlx4/umad/etc modules from my distribution kernel and not OFED kernel-ib ? Thanks in advance for your help . Jean-Charles Lambert.
_______________________________________________ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg