I fixed this one, I believe - will have to look more at the loop_spawn issue later.
On Sat, Jun 9, 2012 at 3:35 PM, Eugene Loh <eugene....@oracle.com> wrote: > On 6/9/2012 12:06 PM, Eugene Loh wrote: > >> With r26565: >> Enable orte progress threads and libevent thread support by default >> Oracle MTT testing started showing new spawn_multiple failures. >> > Sorry. I meant loop_spawn. > > (And then, starting I think in 26582, the problem is masked behind another > issue, "oob:ud:qp_init could not create queue pair", which is creating > widespread problems for Cisco, IU, and Oracle MTT testing. I suppose > that's the subject of a different e-mail thread.) > > I've only seen this in 64-bit. Here are two segfaults, both from >> Linux/x86 systems running over TCP: >> >> This one with GNU compilers: >> [...] >> parent: MPI_Comm_spawn #300 return : 0 >> [burl-ct-v20z-26:28518] *** Process received signal *** >> [burl-ct-v20z-26:28518] Signal: Segmentation fault (11) >> [burl-ct-v20z-26:28518] Signal code: Address not mapped (1) >> [burl-ct-v20z-26:28518] Failing at address: (nil) >> [burl-ct-v20z-26:28518] [ 0] /lib64/libpthread.so.0 [0x3a21c0e7c0] >> [burl-ct-v20z-26:28518] [ 1] /lib64/libc.so.6(memcpy+0x35) >> [0x3a2107bde5] >> [burl-ct-v20z-26:28518] [ 2] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-v20z-26/ompi-**tarball-testing/installs/smMv/** >> install/lib/lib64/openmpi/mca_**oob_tcp.so(mca_oob_tcp_msg_**copy+0x58) >> [burl-ct-v20z-26:28518] [ 3] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-v20z-26/ompi-**tarball-testing/installs/smMv/** >> install/lib/lib64/openmpi/mca_**oob_tcp.so >> [burl-ct-v20z-26:28518] [ 4] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-v20z-26/ompi-**tarball-testing/installs/smMv/** >> install/lib/lib64/openmpi/mca_**oob_tcp.so(mca_oob_tcp_recv_**nb+0x314) >> [burl-ct-v20z-26:28518] [ 5] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-v20z-26/ompi-**tarball-testing/installs/smMv/** >> install/lib/lib64/openmpi/mca_**rml_oob.so(orte_rml_oob_recv_** >> buffer_nb+0xff) >> [burl-ct-v20z-26:28518] [ 6] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-v20z-26/ompi-**tarball-testing/installs/smMv/** >> install/lib/lib64/openmpi/mca_**dpm_orte.so >> [burl-ct-v20z-26:28518] [ 7] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-v20z-26/ompi-**tarball-testing/installs/smMv/** >> install/lib/lib64/libmpi.so.0(**PMPI_Comm_spawn+0x2ee) >> [burl-ct-v20z-26:28518] [ 8] dynamic/loop_spawn [0x40120b] >> [burl-ct-v20z-26:28518] [ 9] /lib64/libc.so.6(__libc_start_**main+0xf4) >> [0x3a2101d994] >> [burl-ct-v20z-26:28518] [10] dynamic/loop_spawn [0x400dd9] >> [burl-ct-v20z-26:28518] *** End of error message *** >> >> This one with Oracle Studio compilers: >> parent: MPI_Comm_spawn #0 return : 0 >> parent: MPI_Comm_spawn #20 return : 0 >> [burl-ct-x2200-12:02348] *** Process received signal *** >> [burl-ct-x2200-12:02348] Signal: Segmentation fault (11) >> [burl-ct-x2200-12:02348] Signal code: Address not mapped (1) >> [burl-ct-x2200-12:02348] Failing at address: 0x10 >> [burl-ct-x2200-12:02348] [ 0] /lib64/libpthread.so.0 [0x318ac0de80] >> [burl-ct-x2200-12:02348] [ 1] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-x2200-12/ompi-**tarball-testing/installs/Q7wT/** >> install/lib/lib64/openmpi/mca_**oob_tcp.so(mca_oob_tcp_msg_** >> recv_handler+0xe3) >> [burl-ct-x2200-12:02348] [ 2] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-x2200-12/ompi-**tarball-testing/installs/Q7wT/** >> install/lib/lib64/openmpi/mca_**oob_tcp.so >> [burl-ct-x2200-12:02348] [ 3] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-x2200-12/ompi-**tarball-testing/installs/Q7wT/** >> install/lib/lib64/libmpi.so.0 >> [burl-ct-x2200-12:02348] [ 4] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-x2200-12/ompi-**tarball-testing/installs/Q7wT/** >> install/lib/lib64/libmpi.so.0(**opal_libevent2019_event_base_** >> loop+0x7c7) >> [burl-ct-x2200-12:02348] [ 5] /workspace/tdontje/hpc/mtt-** >> scratch/burl-ct-x2200-12/ompi-**tarball-testing/installs/Q7wT/** >> install/lib/lib64/libmpi.so.0 >> [burl-ct-x2200-12:02348] [ 6] /lib64/libpthread.so.0 [0x318ac06307] >> [burl-ct-x2200-12:02348] [ 7] /lib64/libc.so.6(clone+0x6d) >> [0x318a0d1ded] >> [burl-ct-x2200-12:02348] *** End of error message *** >> >> Sometimes, I see a hang rather than a segfault. >> ______________________________**_________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/**mailman/listinfo.cgi/devel<http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> > ______________________________**_________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/**mailman/listinfo.cgi/devel<http://www.open-mpi.org/mailman/listinfo.cgi/devel> >