Le 11 juin 2012 à 18:57, Aurélien Bouteiller a écrit : > Hi, > > If some mx devices are found, the logic is not only to use the mx BTL but > also to use the mx MTL. You can try to disable this with --mca mtl ob1. > Sorry, I meant --mca pml ob1
> Aurelien > > > > > Le 11 juin 2012 à 18:24, Yong Qin a écrit : > >> Hi, >> >> We are migrating to Open MPI 1.6 but since 1.6 dropped support for >> Myricom GM driver so we have to switch to the MX driver. We have the >> Myricom MX2G 1.2.16 driver installed. However upon testing the new >> build of Open MPI on a node without the actual Myrinet device, we are >> getting the following segmentation fault. >> >> <----> >> [yqin@n0007.scs00 ~]$ mpirun -np 2 -np 2 osu_bw >> [n0007.scs00:03075] Error in mx_open_endpoint (error No MX device >> entry in /dev.) >> [n0007.scs00:03074] Error in mx_open_endpoint (error No MX device >> entry in /dev.) >> -------------------------------------------------------------------------- >> [[32626,1],0]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: Myrinet/MX >> Host: n0007.scs00 >> >> Another transport will be used instead, although this may result in >> lower performance. >> -------------------------------------------------------------------------- >> [n0007:03074] *** Process received signal *** >> [n0007:03074] Signal: Segmentation fault (11) >> [n0007:03074] Signal code: Invalid permissions (2) >> [n0007:03074] Failing at address: 0x2b9112128130 >> [n0007:03075] *** Process received signal *** >> [n0007:03075] Signal: Segmentation fault (11) >> [n0007:03075] Signal code: Invalid permissions (2) >> [n0007:03075] Failing at address: 0x2b041c9f1130 >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00 >> exited on signal 11 (Segmentation fault). >> -------------------------------------------------------------------------- >> [n0007.scs00:03073] 1 more process has sent help message >> help-mpi-btl-base.txt / btl:no-nics >> [n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0 >> to see all help / error messages >> <----> >> >> Excluding the MX BTL does not get anywhere further. >> >> <----> >> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw >> [n0007.scs00:03453] Error in mx_open_endpoint (error No MX device >> entry in /dev.) >> [n0007.scs00:03454] Error in mx_open_endpoint (error No MX device >> entry in /dev.) >> [n0007:03453] *** Process received signal *** >> [n0007:03453] Signal: Segmentation fault (11) >> [n0007:03453] Signal code: Address not mapped (1) >> [n0007:03453] Failing at address: 0x2b3c1fe73130 >> [n0007:03454] *** Process received signal *** >> [n0007:03454] Signal: Segmentation fault (11) >> [n0007:03454] Signal code: Address not mapped (1) >> [n0007:03454] Failing at address: 0x2b2431bf0130 >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00 >> exited on signal 11 (Segmentation fault). >> -------------------------------------------------------------------------- >> <----> >> >> If we use only designated BTL such as SM and SELF, the binary runs but >> still getting segmentation fault towards the end. >> >> <----> >> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw >> [n0007.scs00:03460] Error in mx_open_endpoint (error No MX device >> entry in /dev.) >> [n0007.scs00:03461] Error in mx_open_endpoint (error No MX device >> entry in /dev.) >> # OSU MPI Bandwidth Test v3.3 >> # Size Bandwidth (MB/s) >> 1 2.54 >> 2 5.22 >> 4 10.92 >> 8 21.61 >> 16 43.89 >> 32 62.19 >> 64 121.95 >> 128 212.28 >> 256 337.52 >> 512 516.67 >> 1024 701.29 >> 2048 845.69 >> 4096 836.45 >> 8192 934.31 >> 16384 1035.53 >> 32768 1186.90 >> 65536 1390.41 >> 131072 1519.14 >> 262144 1562.96 >> 524288 1596.78 >> 1048576 1611.48 >> 2097152 1616.09 >> 4194304 1620.47 >> [n0007:03461] *** Process received signal *** >> [n0007:03460] *** Process received signal *** >> [n0007:03460] Signal: Segmentation fault (11) >> [n0007:03460] Signal code: Address not mapped (1) >> [n0007:03460] Failing at address: 0x2acac044d130 >> [n0007:03461] Signal: Segmentation fault (11) >> [n0007:03461] Signal code: Address not mapped (1) >> [n0007:03461] Failing at address: 0x2b8bc4121130 >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00 >> exited on signal 11 (Segmentation fault). >> -------------------------------------------------------------------------- >> <----> >> >> >> Can anybody shed some light here? It looks like ompi is trying to open >> the MX device no matter what. This is on a fresh build of Open MPI 1.6 >> with "--with-mx --with-openib" options. We didn't have such an issue >> with the old GM BTL. >> >> Thanks, >> >> Yong Qin >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > * Dr. Aurélien Bouteiller > * Researcher at Innovative Computing Laboratory > * University of Tennessee > * 1122 Volunteer Boulevard, suite 309b > * Knoxville, TN 37996 > * 865 974 9375 > > > > > > > -- * Dr. Aurélien Bouteiller * Researcher at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 309b * Knoxville, TN 37996 * 865 974 9375
signature.asc
Description: Message signed with OpenPGP using GPGMail