Le 11 juin 2012 à 18:57, Aurélien Bouteiller a écrit :

> Hi,
> 
> If some mx devices are found, the logic is not only to use the mx BTL but 
> also to use the mx MTL. You can try to disable this with --mca mtl ob1. 
> 
Sorry, I meant --mca pml ob1

> Aurelien
> 
> 
> 
> 
> Le 11 juin 2012 à 18:24, Yong Qin a écrit :
> 
>> Hi,
>> 
>> We are migrating to Open MPI 1.6 but since 1.6 dropped support for
>> Myricom GM driver so we have to switch to the MX driver. We have the
>> Myricom MX2G 1.2.16 driver installed. However upon testing the new
>> build of Open MPI on a node without the actual Myrinet device, we are
>> getting the following segmentation fault.
>> 
>> <---->
>> [yqin@n0007.scs00 ~]$ mpirun -np 2  -np 2 osu_bw
>> [n0007.scs00:03075] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> [n0007.scs00:03074] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> --------------------------------------------------------------------------
>> [[32626,1],0]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>> 
>> Module: Myrinet/MX
>> Host: n0007.scs00
>> 
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> [n0007:03074] *** Process received signal ***
>> [n0007:03074] Signal: Segmentation fault (11)
>> [n0007:03074] Signal code: Invalid permissions (2)
>> [n0007:03074] Failing at address: 0x2b9112128130
>> [n0007:03075] *** Process received signal ***
>> [n0007:03075] Signal: Segmentation fault (11)
>> [n0007:03075] Signal code: Invalid permissions (2)
>> [n0007:03075] Failing at address: 0x2b041c9f1130
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00
>> exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> [n0007.scs00:03073] 1 more process has sent help message
>> help-mpi-btl-base.txt / btl:no-nics
>> [n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0
>> to see all help / error messages
>> <---->
>> 
>> Excluding the MX BTL does not get anywhere further.
>> 
>> <---->
>> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw
>> [n0007.scs00:03453] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> [n0007.scs00:03454] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> [n0007:03453] *** Process received signal ***
>> [n0007:03453] Signal: Segmentation fault (11)
>> [n0007:03453] Signal code: Address not mapped (1)
>> [n0007:03453] Failing at address: 0x2b3c1fe73130
>> [n0007:03454] *** Process received signal ***
>> [n0007:03454] Signal: Segmentation fault (11)
>> [n0007:03454] Signal code: Address not mapped (1)
>> [n0007:03454] Failing at address: 0x2b2431bf0130
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00
>> exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> <---->
>> 
>> If we use only designated BTL such as SM and SELF, the binary runs but
>> still getting segmentation fault towards the end.
>> 
>> <---->
>> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw
>> [n0007.scs00:03460] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> [n0007.scs00:03461] Error in mx_open_endpoint (error No MX device
>> entry in /dev.)
>> # OSU MPI Bandwidth Test v3.3
>> # Size        Bandwidth (MB/s)
>> 1                         2.54
>> 2                         5.22
>> 4                        10.92
>> 8                        21.61
>> 16                       43.89
>> 32                       62.19
>> 64                      121.95
>> 128                     212.28
>> 256                     337.52
>> 512                     516.67
>> 1024                    701.29
>> 2048                    845.69
>> 4096                    836.45
>> 8192                    934.31
>> 16384                  1035.53
>> 32768                  1186.90
>> 65536                  1390.41
>> 131072                 1519.14
>> 262144                 1562.96
>> 524288                 1596.78
>> 1048576                1611.48
>> 2097152                1616.09
>> 4194304                1620.47
>> [n0007:03461] *** Process received signal ***
>> [n0007:03460] *** Process received signal ***
>> [n0007:03460] Signal: Segmentation fault (11)
>> [n0007:03460] Signal code: Address not mapped (1)
>> [n0007:03460] Failing at address: 0x2acac044d130
>> [n0007:03461] Signal: Segmentation fault (11)
>> [n0007:03461] Signal code: Address not mapped (1)
>> [n0007:03461] Failing at address: 0x2b8bc4121130
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00
>> exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> <---->
>> 
>> 
>> Can anybody shed some light here? It looks like ompi is trying to open
>> the MX device no matter what. This is on a fresh build of Open MPI 1.6
>> with "--with-mx --with-openib" options. We didn't have such an issue
>> with the old GM BTL.
>> 
>> Thanks,
>> 
>> Yong Qin
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> --
> * Dr. Aurélien Bouteiller
> * Researcher at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 309b
> * Knoxville, TN 37996
> * 865 974 9375
> 
> 
> 
> 
> 
> 
> 

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375







Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to