I am unable to replicate the segfault. However, I was able to get the job to
hang. I fixed that behavior with r18044.

Perhaps you can test this again and let me know what you see. A gdb stack
trace would be more helpful.

Thanks
Ralph



On 3/31/08 5:13 AM, "Lenny Verkhovsky" <len...@voltaire.com> wrote:

> 
> 
> 
> I accidently run job on the hostfile where one of hosts was not properly
> mounted. As a result I got an error and a segfault.
> 
> 
> /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun -np 29 -hostfile hostfile
> ./mpi_p01 -t lt
> bash: /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/orted: No such file or
> directory
> ------------------------------------------------------------------------
> --
> A daemon (pid 9753) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> ------------------------------------------------------------------------
> --
> ------------------------------------------------------------------------
> --
> mpirun was unable to start the specified application as it encountered
> an error.
> More information may be available above.
> ------------------------------------------------------------------------
> --
> [witch1:09745] *** Process received signal ***
> [witch1:09745] Signal: Segmentation fault (11)
> [witch1:09745] Signal code: Address not mapped (1)
> [witch1:09745] Failing at address: 0x3c
> [witch1:09745] [ 0] /lib64/libpthread.so.0 [0x2aff223ebc10]
> [witch1:09745] [ 1]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0 [0x2aff21cdfe21]
> [witch1:09745] [ 2]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_rml_oob.so
> [0x2aff22c398f1]
> [witch1:09745] [ 3]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_oob_tcp.so
> [0x2aff22d426ee]
> [witch1:09745] [ 4]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_oob_tcp.so
> [0x2aff22d433fb]
> [witch1:09745] [ 5]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_oob_tcp.so
> [0x2aff22d4485b]
> [witch1:09745] [ 6]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0 [0x2aff21e1242b]
> [witch1:09745] [ 7] /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun
> [0x403203]
> [witch1:09745] [ 8]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0 [0x2aff21e1242b]
> [witch1:09745] [ 9]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0(opal_progress+0x
> 8b) [0x2aff21e060cb]
> [witch1:09745] [10]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0(orte_trigger_eve
> nt+0x20) [0x2aff21cc6940]
> [witch1:09745] [11]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0(orte_wakeup+0x2d
> ) [0x2aff21cc776d]
> [witch1:09745] [12]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_plm_rsh.so
> [0x2aff22b34756]
> [witch1:09745] [13]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0 [0x2aff21cc6ea7]
> [witch1:09745] [14]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0 [0x2aff21e1242b]
> [witch1:09745] [15]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-pal.so.0(opal_progress+0x
> 8b) [0x2aff21e060cb]
> [witch1:09745] [16]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/libopen-rte.so.0(orte_plm_base_da
> emon_callback+0xad) [0x2aff21ce068d]
> [witch1:09745] [17]
> /home/USERS/lenny/OMPI_ORTE_TRUNK//lib/openmpi/mca_plm_rsh.so
> [0x2aff22b34e5e]
> [witch1:09745] [18] /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun
> [0x402e13]
> [witch1:09745] [19] /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun
> [0x402873]
> [witch1:09745] [20] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2aff22512154]
> [witch1:09745] [21] /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun
> [0x4027c9]
> [witch1:09745] *** End of error message ***
> Segmentation fault (core dumped)
> 
> 
> Best Regards,
> Lenny.
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to