[OMPI devel] How to debug segv
Hi, I'm getting a segv error off my build of the trunk. I know that my BTL module is responsible ("-mca btl self,tcp" works, "-mca btl self,mosix" fails). Smaller/simpler test applications pass, NPB doesn't. Can anyone suggest how to proceed with debugging this? my attempts include some debug printouts, and GDB which appears below... What can I do next? I'll appreciate any input, Alex alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun --debug-daemons -d -n 4 xterm -l -e gdb ft.S.4 [singularity:07557] procdir: /tmp/openmpi-sessions-alex@singularity_0/44228/0/0 [singularity:07557] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/0 [singularity:07557] top: openmpi-sessions-alex@singularity_0 [singularity:07557] tmp: /tmp [singularity:07557] [[44228,0],0] hostfile: checking hostfile /home/alex/huji/ompi/etc/openmpi-default-hostfile for nodes [singularity:07557] [[44228,0],0] hostfile: filtering nodes through hostfile /home/alex/huji/ompi/etc/openmpi-default-hostfile [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS [singularity:07557] [[44228,0],0] orted_cmd: received add_local_procs MPIR_being_debugged = 0 MPIR_debug_state = 1 MPIR_partial_attach_ok = 1 MPIR_i_am_starter = 0 MPIR_forward_output = 0 MPIR_proctable_size = 4 MPIR_proctable: (i, host, exe, pid) = (0, singularity, /usr/bin/xterm, 7558) (i, host, exe, pid) = (1, singularity, /usr/bin/xterm, 7559) (i, host, exe, pid) = (2, singularity, /usr/bin/xterm, 7560) (i, host, exe, pid) = (3, singularity, /usr/bin/xterm, 7561) MPIR_executable_path: NULL MPIR_server_arguments: NULL [singularity:07592] procdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1/3 [singularity:07592] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 [singularity:07592] top: openmpi-sessions-alex@singularity_0 [singularity:07592] tmp: /tmp [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local proc [[44228,1],3] [singularity:07592] [[44228,1],3] decode:nidmap decoding nodemap [singularity:07592] [[44228,1],3] decode:nidmap decoding 1 nodes [singularity:07592] [[44228,1],3] node[0].name singularity daemon 0 [singularity:07594] procdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1/1 [singularity:07594] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 [singularity:07594] top: openmpi-sessions-alex@singularity_0 [singularity:07594] tmp: /tmp [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local proc [[44228,1],1] [singularity:07594] [[44228,1],1] decode:nidmap decoding nodemap [singularity:07594] [[44228,1],1] decode:nidmap decoding 1 nodes [singularity:07594] [[44228,1],1] node[0].name singularity daemon 0 [singularity:07596] procdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1/0 [singularity:07596] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 [singularity:07596] top: openmpi-sessions-alex@singularity_0 [singularity:07596] tmp: /tmp [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local proc [[44228,1],0] [singularity:07596] [[44228,1],0] decode:nidmap decoding nodemap [singularity:07596] [[44228,1],0] decode:nidmap decoding 1 nodes [singularity:07596] [[44228,1],0] node[0].name singularity daemon 0 [singularity:07598] procdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1/2 [singularity:07598] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 [singularity:07598] top: openmpi-sessions-alex@singularity_0 [singularity:07598] tmp: /tmp [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local proc [[44228,1],2] [singularity:07598] [[44228,1],2] decode:nidmap decoding nodemap [singularity:07598] [[44228,1],2] decode:nidmap decoding 1 nodes [singularity:07598] [[44228,1],2] node[0].name singularity daemon 0 [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS [singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs [singularity:07557] [[44228,0],0] orted:comm:message_local_procs delivering message to job [44228,1] tag 30 [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS [singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs [singularity:07557] [[44228,0],0] orted:comm:message_local_procs delivering message to job [44228,1] tag 30 [singularity:07557] [[44228,0],0]:errmgr_default_hnp.c(418) updating exit status to
Re: [OMPI devel] How to debug segv
Strange that your code didn't generate any symbols - is that a mosix thing? Have you tried just adding opal_output (so it goes to a special diagnostic output channel) statements in your code to see where the segfault is occurring? It looks like you are getting thru orte_init. You could add -mca grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, then you are probably failing in add_procs. On Apr 25, 2012, at 5:05 AM, Alex Margolin wrote: > Hi, > > I'm getting a segv error off my build of the trunk. I know that my BTL module > is responsible ("-mca btl self,tcp" works, "-mca btl self,mosix" fails). > Smaller/simpler test applications pass, NPB doesn't. Can anyone suggest how > to proceed with debugging this? my attempts include some debug printouts, and > GDB which appears below... What can I do next? > > I'll appreciate any input, > Alex > > alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun --debug-daemons -d -n 4 > xterm -l -e gdb ft.S.4 > [singularity:07557] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/0/0 > [singularity:07557] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/0 > [singularity:07557] top: openmpi-sessions-alex@singularity_0 > [singularity:07557] tmp: /tmp > [singularity:07557] [[44228,0],0] hostfile: checking hostfile > /home/alex/huji/ompi/etc/openmpi-default-hostfile for nodes > [singularity:07557] [[44228,0],0] hostfile: filtering nodes through hostfile > /home/alex/huji/ompi/etc/openmpi-default-hostfile > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_ADD_LOCAL_PROCS > [singularity:07557] [[44228,0],0] orted_cmd: received add_local_procs > MPIR_being_debugged = 0 > MPIR_debug_state = 1 > MPIR_partial_attach_ok = 1 > MPIR_i_am_starter = 0 > MPIR_forward_output = 0 > MPIR_proctable_size = 4 > MPIR_proctable: >(i, host, exe, pid) = (0, singularity, /usr/bin/xterm, 7558) >(i, host, exe, pid) = (1, singularity, /usr/bin/xterm, 7559) >(i, host, exe, pid) = (2, singularity, /usr/bin/xterm, 7560) >(i, host, exe, pid) = (3, singularity, /usr/bin/xterm, 7561) > MPIR_executable_path: NULL > MPIR_server_arguments: NULL > [singularity:07592] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/1/3 > [singularity:07592] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 > [singularity:07592] top: openmpi-sessions-alex@singularity_0 > [singularity:07592] tmp: /tmp > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_SYNC_WANT_NIDMAP > [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local > proc [[44228,1],3] > [singularity:07592] [[44228,1],3] decode:nidmap decoding nodemap > [singularity:07592] [[44228,1],3] decode:nidmap decoding 1 nodes > [singularity:07592] [[44228,1],3] node[0].name singularity daemon 0 > [singularity:07594] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/1/1 > [singularity:07594] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 > [singularity:07594] top: openmpi-sessions-alex@singularity_0 > [singularity:07594] tmp: /tmp > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_SYNC_WANT_NIDMAP > [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local > proc [[44228,1],1] > [singularity:07594] [[44228,1],1] decode:nidmap decoding nodemap > [singularity:07594] [[44228,1],1] decode:nidmap decoding 1 nodes > [singularity:07594] [[44228,1],1] node[0].name singularity daemon 0 > [singularity:07596] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/1/0 > [singularity:07596] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 > [singularity:07596] top: openmpi-sessions-alex@singularity_0 > [singularity:07596] tmp: /tmp > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_SYNC_WANT_NIDMAP > [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local > proc [[44228,1],0] > [singularity:07596] [[44228,1],0] decode:nidmap decoding nodemap > [singularity:07596] [[44228,1],0] decode:nidmap decoding 1 nodes > [singularity:07596] [[44228,1],0] node[0].name singularity daemon 0 > [singularity:07598] procdir: > /tmp/openmpi-sessions-alex@singularity_0/44228/1/2 > [singularity:07598] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1 > [singularity:07598] top: openmpi-sessions-alex@singularity_0 > [singularity:07598] tmp: /tmp > [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing > Command: ORTE_DAEMON_SYNC_WANT_NIDMAP > [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local > proc [[44228,1],2] > [singularity:07598] [[44228,1],2] decode:nidmap decoding nodemap > [singularity:07598] [[44228,1],2] decode:nidmap decoding 1 nodes > [singularity:07598] [[44228,1],2] node[0].name singularity daemon 0 > [singularity:07557] [[44228,0],0] orted:comm:process_commands(
Re: [OMPI devel] How to debug segv
On 04/25/2012 02:57 PM, Ralph Castain wrote: Strange that your code didn't generate any symbols - is that a mosix thing? Have you tried just adding opal_output (so it goes to a special diagnostic output channel) statements in your code to see where the segfault is occurring? It looks like you are getting thru orte_init. You could add -mca grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, then you are probably failing in add_procs. I guess the symbols are a mosix thing, but it should still show some sort of segmentation fault trace, no? maybe only the assembly opcode... It seems that the SEGV is detected, rather then caught. This may also be related to mosix - I'll check it with the mosix developer. I added the parameter you suggested and appended the output. Modex seems to be working because I use it to exchange the IP and PID, and as you can see at the bottom these are received OK. I'll try debug printouts specifically in add_procs. Thanks for the advice! alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun -mca grpcomm_base_verbose 5 -mca btl self,mosix -mca btl_base_verbose 100 -n 4 ft.S.4 [singularity:08915] mca:base:select:(grpcomm) Querying component [bad] [singularity:08915] mca:base:select:(grpcomm) Query of component [bad] set priority to 10 [singularity:08915] mca:base:select:(grpcomm) Selected component [bad] [singularity:08915] [[37778,0],0] grpcomm:base:receive start comm [singularity:08915] [[37778,0],0] grpcomm:bad:xcast sent to job [37778,0] tag 1 [singularity:08915] [[37778,0],0] grpcomm:xcast:recv:send_relay [singularity:08915] [[37778,0],0] grpcomm:base:xcast updating nidmap [singularity:08915] [[37778,0],0] orte:daemon:send_relay - recipient list is empty! [singularity:08916] mca:base:select:(grpcomm) Querying component [bad] [singularity:08916] mca:base:select:(grpcomm) Query of component [bad] set priority to 10 [singularity:08916] mca:base:select:(grpcomm) Selected component [bad] [singularity:08916] [[37778,1],0] grpcomm:base:receive start comm [singularity:08919] mca:base:select:(grpcomm) Querying component [bad] [singularity:08919] mca:base:select:(grpcomm) Query of component [bad] set priority to 10 [singularity:08919] mca:base:select:(grpcomm) Selected component [bad] [singularity:08919] [[37778,1],2] grpcomm:base:receive start comm [singularity:08917] mca:base:select:(grpcomm) Querying component [bad] [singularity:08917] mca:base:select:(grpcomm) Query of component [bad] set priority to 10 [singularity:08917] mca:base:select:(grpcomm) Selected component [bad] [singularity:08917] [[37778,1],1] grpcomm:base:receive start comm [singularity:08921] mca:base:select:(grpcomm) Querying component [bad] [singularity:08921] mca:base:select:(grpcomm) Query of component [bad] set priority to 10 [singularity:08921] mca:base:select:(grpcomm) Selected component [bad] [singularity:08921] [[37778,1],3] grpcomm:base:receive start comm [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute MPI_THREAD_LEVEL data size 1 [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute OMPI_ARCH data size 11 [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute MPI_THREAD_LEVEL data size 1 [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute OMPI_ARCH data size 11 [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute MPI_THREAD_LEVEL data size 1 [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute OMPI_ARCH data size 11 [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute MPI_THREAD_LEVEL data size 1 [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute OMPI_ARCH data size 11 [singularity:08916] mca: base: components_open: Looking for btl components [singularity:08916] mca: base: components_open: opening btl components [singularity:08916] mca: base: components_open: found loaded component mosix [singularity:08916] mca: base: components_open: component mosix register function successful [singularity:08916] mca: base: components_open: component mosix open function successful [singularity:08916] mca: base: components_open: found loaded component self [singularity:08916] mca: base: components_open: component self has no register function [singularity:08916] mca: base: components_open: component self open function successful [singularity:08919] mca: base: components_open: Looking for btl components [singularity:08917] mca: base: components_open: Looking for btl components [singularity:08919] mca: base: components_open: opening btl components [singularity:08919] mca: base: components_open: found loaded component mosix [singularity:08919] mca: base: components_open: component mosix register function successful [singularity:08919] mca: base: components_open: component mosix open function successful [singularity:08919] mca: base: components_open: found loaded component self [sin
Re: [OMPI devel] How to debug segv
Another thing to try is to load up the core file in gdb and see if that gives you a valid stack trace of where exactly the segv occurred. On Apr 25, 2012, at 9:30 AM, Alex Margolin wrote: > On 04/25/2012 02:57 PM, Ralph Castain wrote: >> Strange that your code didn't generate any symbols - is that a mosix thing? >> Have you tried just adding opal_output (so it goes to a special diagnostic >> output channel) statements in your code to see where the segfault is >> occurring? >> >> It looks like you are getting thru orte_init. You could add -mca >> grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, >> then you are probably failing in add_procs. >> > I guess the symbols are a mosix thing, but it should still show some sort of > segmentation fault trace, no? maybe only the assembly opcode... It seems that > the SEGV is detected, rather then caught. This may also be related to mosix - > I'll check it with the mosix developer. > > I added the parameter you suggested and appended the output. Modex seems to > be working because I use it to exchange the IP and PID, and as you can see at > the bottom these are received OK. I'll try debug printouts specifically in > add_procs. Thanks for the advice! > > alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun -mca grpcomm_base_verbose > 5 -mca btl self,mosix -mca btl_base_verbose 100 -n 4 ft.S.4 > [singularity:08915] mca:base:select:(grpcomm) Querying component [bad] > [singularity:08915] mca:base:select:(grpcomm) Query of component [bad] set > priority to 10 > [singularity:08915] mca:base:select:(grpcomm) Selected component [bad] > [singularity:08915] [[37778,0],0] grpcomm:base:receive start comm > [singularity:08915] [[37778,0],0] grpcomm:bad:xcast sent to job [37778,0] tag > 1 > [singularity:08915] [[37778,0],0] grpcomm:xcast:recv:send_relay > [singularity:08915] [[37778,0],0] grpcomm:base:xcast updating nidmap > [singularity:08915] [[37778,0],0] orte:daemon:send_relay - recipient list is > empty! > [singularity:08916] mca:base:select:(grpcomm) Querying component [bad] > [singularity:08916] mca:base:select:(grpcomm) Query of component [bad] set > priority to 10 > [singularity:08916] mca:base:select:(grpcomm) Selected component [bad] > [singularity:08916] [[37778,1],0] grpcomm:base:receive start comm > [singularity:08919] mca:base:select:(grpcomm) Querying component [bad] > [singularity:08919] mca:base:select:(grpcomm) Query of component [bad] set > priority to 10 > [singularity:08919] mca:base:select:(grpcomm) Selected component [bad] > [singularity:08919] [[37778,1],2] grpcomm:base:receive start comm > [singularity:08917] mca:base:select:(grpcomm) Querying component [bad] > [singularity:08917] mca:base:select:(grpcomm) Query of component [bad] set > priority to 10 > [singularity:08917] mca:base:select:(grpcomm) Selected component [bad] > [singularity:08917] [[37778,1],1] grpcomm:base:receive start comm > [singularity:08921] mca:base:select:(grpcomm) Querying component [bad] > [singularity:08921] mca:base:select:(grpcomm) Query of component [bad] set > priority to 10 > [singularity:08921] mca:base:select:(grpcomm) Selected component [bad] > [singularity:08921] [[37778,1],3] grpcomm:base:receive start comm > [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute > MPI_THREAD_LEVEL data size 1 > [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute > OMPI_ARCH data size 11 > [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute > MPI_THREAD_LEVEL data size 1 > [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute > OMPI_ARCH data size 11 > [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute > MPI_THREAD_LEVEL data size 1 > [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute > OMPI_ARCH data size 11 > [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute > MPI_THREAD_LEVEL data size 1 > [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute > OMPI_ARCH data size 11 > [singularity:08916] mca: base: components_open: Looking for btl components > [singularity:08916] mca: base: components_open: opening btl components > [singularity:08916] mca: base: components_open: found loaded component mosix > [singularity:08916] mca: base: components_open: component mosix register > function successful > [singularity:08916] mca: base: components_open: component mosix open function > successful > [singularity:08916] mca: base: components_open: found loaded component self > [singularity:08916] mca: base: components_open: component self has no > register function > [singularity:08916] mca: base: components_open: component self open function > successful > [singularity:08919] mca: base: components_open: Looking for btl components > [singularity:08917] mca: base: components_open: Looking for btl components > [singularity:08919] mca: base: components_open: opening btl
Re: [OMPI devel] How to debug segv
Alex, You got the banner of the FT benchmark, so I guess at least the rank 0 successfully completed the MPI_Init call. This is a hint that you should investigate more into the point-to-point logic of your mosix BTL. george. On Apr 25, 2012, at 09:30 , Alex Margolin wrote: > NAS Parallel Benchmarks 3.3 -- FT Benchmark > > No input file inputft.data. Using compiled defaults > Size: 64x 64x 64 > Iterations : 6 > Number of processes : 4 > Processor array : 1x 4 > Layout type : 1D
[OMPI devel] Fwd: GNU autoconf 2.69 released [stable]
There are a number of new Autoconf macros that would be useful for OMPI's Fortran configury. Meaning: we have klugearounds in our existing configury, but the new AC 2.69 macros are Better. How would people feel about upgrading the autoconf requirement on the trunk to AC 2.69? (Terry: please add this to the agenda for next Tuesday; thanks) Begin forwarded message: > From: Eric Blake > Subject: GNU autoconf 2.69 released [stable] > Date: April 24, 2012 11:32:32 PM EDT > To: info-...@gnu.org, Autoconf > Cc: autotools-annou...@gnu.org, "bug-autoc...@gnu.org" > Reply-To: Autoconf > > The GNU Autoconf team is pleased to announce the stable release of > Autoconf 2.69. Autoconf is an extensible package of M4 macros that > produce shell scripts to automatically configure software source code > packages. These scripts can adapt the packages to many kinds of > UNIX-like systems without manual user intervention. Autoconf creates a > configuration script for a package from a template file that lists the > operating system features that the package can use, in the form of M4 > macro calls. > > Among other improvements, this release fixes a couple of regressions > introduced in previous releases, greatly enhances Fortran support, adds > Go support, and updates the documentation license. It also requires > that developer have perl 5.6 or newer when running autoconf (although > generated configure scripts remain independent of perl, as always). See > a more complete list below. > > Here are the compressed sources: > http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.gz (1.9MB) > http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.xz (1.2MB) > > Here are the GPG detached signatures[*]: > http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.gz.sig > http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.xz.sig > > Use a mirror for higher download bandwidth: > http://www.gnu.org/order/ftp.html > > [*] Use a .sig file to verify that the corresponding file (without the > .sig suffix) is intact. First, be sure to download both the .sig file > and the corresponding tarball. Then, run a command like this: > > gpg --verify autoconf-2.69.tar.gz.sig > > If that command fails because you don't have the required public key, > then run this command to import it: > > gpg --keyserver keys.gnupg.net --recv-keys A7A16B4A2527436A > > and rerun the 'gpg --verify' command. > > This release was bootstrapped with the following tools: > Automake 1.11.1 > > NEWS > > * Noteworthy changes in release 2.69 (2012-04-24) [stable] > > ** Autoconf now requires perl 5.6 or better (but generated configure > scripts continue to run without perl). > > * Noteworthy changes in release 2.68b (2012-03-01) [beta] > Released by Eric Blake, based on git versions 2.68.*. > > ** Autoconf-generated configure scripts now unconditionally re-execute > themselves with $CONFIG_SHELL, if that's set in the environment. > > ** The texinfo documentation no longer specifies "front-cover" or > "back-cover" texts, so that it may now be included in Debian's > "main" section. > > ** Support for the Go programming language has been added. The new > macro AC_LANG_GO sets variables GOC and GOFLAGS. > > ** AS_LITERAL_IF again treats '=' as a literal. Regression introduced > in 2.66. > > ** The macro AS_EXECUTABLE_P, present since 2.50, is now documented. > > ** Macros > > - AC_PROG_LN_S and AS_LN_S now fall back on 'cp -pR' (not 'cp -p') if > 'ln -s' does not work. This works better for symlinks to directories. > > - New macro AC_HEADER_CHECK_STDBOOL. > > - New and updated macros for Fortran support: > >AC_FC_CHECK_BOUNDS to enable array bounds checking >AC_F77_IMPLICIT_NONE and AC_FC_IMPLICIT_NONE to disable implicit integer >AC_FC_MODULE_EXTENSION to compute the Fortran 90 module name extension >AC_FC_MODULE_FLAG for the Fortran 90 module search path flag >AC_FC_MODULE_OUTPUT_FLAG for the Fortran 90 module output directory flag >AC_FC_PP_SRCEXT for preprocessed Fortran source files extensions >AC_FC_PP_DEFINE for the Fortran preprocessor define flag > > -- > Eric Blake, on behalf of > The GNU Autoconf team > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] 1.6rc1 has been released
Note that Open MPI 1.6 is the evolution of the 1.5 series -- it is not a new branch from the SVN trunk. Hence, 1.6 is essentially a bunch of bug fixes on top of 1.5.5. Please test: http://www.open-mpi.org/software/ompi/v1.6/ (note that the 1.6 page is not linked to from anywhere on the OMPI site yet) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] libevent socket code
Anyone object if I #if 0 out all the socket code in libevent. We see lots of static compilation warnings because of that code and nothing in openmpi uses it. -Nathan
Re: [OMPI devel] libevent socket code
Can't it be done with configuring --without-libevent-sockets or some such thing? I really hate munging the code directly as it creates lots of support issues and makes it harder to upgrade. On Apr 25, 2012, at 10:45 AM, Nathan Hjelm wrote: > Anyone object if I #if 0 out all the socket code in libevent. We see lots of > static compilation warnings because of that code and nothing in openmpi uses > it. > > -Nathan > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] libevent socket code
On Apr 25, 2012, at 12:50 PM, Ralph Castain wrote: > Can't it be done with configuring --without-libevent-sockets or some such > thing? I really hate munging the code directly as it creates lots of support > issues and makes it harder to upgrade. If there's a libevent configure option we should be using, we can probably set that to be enabled by default. Let me know. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] libevent socket code
Let me take a look. The code in question is in evutil.c and bufferevent_sock.c . If there is no option we might be able to get away with just removing these files from the Makefile.am. -Nathan On Wed, 25 Apr 2012, Jeff Squyres wrote: On Apr 25, 2012, at 12:50 PM, Ralph Castain wrote: Can't it be done with configuring --without-libevent-sockets or some such thing? I really hate munging the code directly as it creates lots of support issues and makes it harder to upgrade. If there's a libevent configure option we should be using, we can probably set that to be enabled by default. Let me know. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] How to debug segv
I guess you are right. I started looking into the communication passing between processes and I may have found a problem with the way I handle "reserved" data requested at prepare_src()... I've tried to write pretty much the same as TCP (the relevant code is around "if(opal_convertor_need_buffers(convertor))") and when I copy the buffered data to (frag+1) the program works. When I try to optimize the code by allowing the segment to point to the original location, I get MPI_ERR_TRUNCATE. I've printed out the data sent and recieved, and what I got ("[]" for sent, "<>" for received, running osu_latency) is appended below. Question is: Where is the code which is responsible for writing the reserved data? Thanks, Alex Always assume opal_convertor_need_buffers - works (97 is the application data, preceded by 14 reserved bytes): ... [65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,] <65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,> [65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,] <65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,> [65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,] <65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,> [65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,] <65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,> [65,0,0,0,1,0,0,0,1,0,0,0,91,-112,97,97,97,97,] ... Detect when not opal_convertor_need_buffers - fails: ... [65,0,0,0,0,0,0,0,1,0,0,0,-15,85,] <65,0,0,0,0,0,0,0,1,0,0,0,-15,85,97,> [65,0,0,0,1,0,0,0,1,0,0,0,-15,85,] <65,0,0,0,1,0,0,0,1,0,0,0,-15,85,97,> [65,0,0,0,0,0,0,0,1,0,0,0,-14,85,] <65,0,0,0,0,0,0,0,1,0,0,0,-14,85,97,> [65,0,0,0,1,0,0,0,1,0,0,0,-14,85,] <65,0,0,0,1,0,0,0,1,0,0,0,-14,85,97,> [65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,] 1 453.26 [65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,] <65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,97,> <65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,97,> [singularity:13509] *** An error occurred in MPI_Barrier [singularity:13509] *** reported by process [2239889409,140733193388033] [singularity:13509] *** on communicator MPI_COMM_WORLD [singularity:13509] *** MPI_ERR_TRUNCATE: message truncated [singularity:13509] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [singularity:13509] ***and potentially your MPI job) [singularity:13507] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [singularity:13507] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages alex@singularity:~/huji/benchmarks/mpi/osu-micro-benchmarks-3.5.2$ On 04/25/2012 04:35 PM, George Bosilca wrote: Alex, You got the banner of the FT benchmark, so I guess at least the rank 0 successfully completed the MPI_Init call. This is a hint that you should investigate more into the point-to-point logic of your mosix BTL. george. On Apr 25, 2012, at 09:30 , Alex Margolin wrote: NAS Parallel Benchmarks 3.3 -- FT Benchmark No input file inputft.data. Using compiled defaults Size: 64x 64x 64 Iterations : 6 Number of processes : 4 Processor array : 1x 4 Layout type : 1D ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] How to debug segv
Alex, +1 vote for core. It is good starting point. * If you can't (from some reason) generate the core file, you may drop while (1) somewhere in the init code and attach the gdb later. * If you are looking for more user-friendly experience, you may try Allinea DDT (they have 30day trial version). Regards, Pasha. > Another thing to try is to load up the core file in gdb and see if that gives > you a valid stack trace of where exactly the segv occurred. > > > On Apr 25, 2012, at 9:30 AM, Alex Margolin wrote: > >> On 04/25/2012 02:57 PM, Ralph Castain wrote: >>> Strange that your code didn't generate any symbols - is that a mosix thing? >>> Have you tried just adding opal_output (so it goes to a special diagnostic >>> output channel) statements in your code to see where the segfault is >>> occurring? >>> >>> It looks like you are getting thru orte_init. You could add -mca >>> grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, >>> then you are probably failing in add_procs. >>> >> I guess the symbols are a mosix thing, but it should still show some sort of >> segmentation fault trace, no? maybe only the assembly opcode... It seems >> that the SEGV is detected, rather then caught. This may also be related to >> mosix - I'll check it with the mosix developer. >> >> I added the parameter you suggested and appended the output. Modex seems to >> be working because I use it to exchange the IP and PID, and as you can see >> at the bottom these are received OK. I'll try debug printouts specifically >> in add_procs. Thanks for the advice! >> >> alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun -mca grpcomm_base_verbose >> 5 -mca btl self,mosix -mca btl_base_verbose 100 -n 4 ft.S.4 >> [singularity:08915] mca:base:select:(grpcomm) Querying component [bad] >> [singularity:08915] mca:base:select:(grpcomm) Query of component [bad] set >> priority to 10 >> [singularity:08915] mca:base:select:(grpcomm) Selected component [bad] >> [singularity:08915] [[37778,0],0] grpcomm:base:receive start comm >> [singularity:08915] [[37778,0],0] grpcomm:bad:xcast sent to job [37778,0] >> tag 1 >> [singularity:08915] [[37778,0],0] grpcomm:xcast:recv:send_relay >> [singularity:08915] [[37778,0],0] grpcomm:base:xcast updating nidmap >> [singularity:08915] [[37778,0],0] orte:daemon:send_relay - recipient list is >> empty! >> [singularity:08916] mca:base:select:(grpcomm) Querying component [bad] >> [singularity:08916] mca:base:select:(grpcomm) Query of component [bad] set >> priority to 10 >> [singularity:08916] mca:base:select:(grpcomm) Selected component [bad] >> [singularity:08916] [[37778,1],0] grpcomm:base:receive start comm >> [singularity:08919] mca:base:select:(grpcomm) Querying component [bad] >> [singularity:08919] mca:base:select:(grpcomm) Query of component [bad] set >> priority to 10 >> [singularity:08919] mca:base:select:(grpcomm) Selected component [bad] >> [singularity:08919] [[37778,1],2] grpcomm:base:receive start comm >> [singularity:08917] mca:base:select:(grpcomm) Querying component [bad] >> [singularity:08917] mca:base:select:(grpcomm) Query of component [bad] set >> priority to 10 >> [singularity:08917] mca:base:select:(grpcomm) Selected component [bad] >> [singularity:08917] [[37778,1],1] grpcomm:base:receive start comm >> [singularity:08921] mca:base:select:(grpcomm) Querying component [bad] >> [singularity:08921] mca:base:select:(grpcomm) Query of component [bad] set >> priority to 10 >> [singularity:08921] mca:base:select:(grpcomm) Selected component [bad] >> [singularity:08921] [[37778,1],3] grpcomm:base:receive start comm >> [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute >> MPI_THREAD_LEVEL data size 1 >> [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute >> OMPI_ARCH data size 11 >> [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute >> MPI_THREAD_LEVEL data size 1 >> [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute >> OMPI_ARCH data size 11 >> [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute >> MPI_THREAD_LEVEL data size 1 >> [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute >> OMPI_ARCH data size 11 >> [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute >> MPI_THREAD_LEVEL data size 1 >> [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute >> OMPI_ARCH data size 11 >> [singularity:08916] mca: base: components_open: Looking for btl components >> [singularity:08916] mca: base: components_open: opening btl components >> [singularity:08916] mca: base: components_open: found loaded component mosix >> [singularity:08916] mca: base: components_open: component mosix register >> function successful >> [singularity:08916] mca: base: components_open: component mosix open >> function successful >> [singularity:08916] mca: base: components_open: found loaded component self
Re: [OMPI devel] How to debug segv
On Apr 25, 2012, at 13:59 , Alex Margolin wrote: > I guess you are right. > > I started looking into the communication passing between processes and I may > have found a problem with the way I handle "reserved" data requested at > prepare_src()... I've tried to write pretty much the same as TCP (the > relevant code is around "if(opal_convertor_need_buffers(convertor))") and > when I copy the buffered data to (frag+1) the program works. When I try to > optimize the code by allowing the segment to point to the original location, > I get MPI_ERR_TRUNCATE. I've printed out the data sent and recieved, and what > I got ("[]" for sent, "<>" for received, running osu_latency) is appended > below. > > Question is: Where is the code which is responsible for writing the reserved > data? It is the PML headers. Based on the error you reported OMPI is complaining about truncated data on an MPI_Barrier … that's quite bad as the barrier is one of the few operations that do not manipulate any data. I guess the PML headers are not located at the expected displacement in the fragment, so the PML is using wrong values. george. > > Thanks, > Alex > > > Always assume opal_convertor_need_buffers - works (97 is the application > data, preceded by 14 reserved bytes): > > ... > [65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,] > <65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,> > [65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,] > <65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,> > [65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,] > <65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,> > [65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,] > <65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,> > [65,0,0,0,1,0,0,0,1,0,0,0,91,-112,97,97,97,97,] > ... > > Detect when not opal_convertor_need_buffers - fails: > > ... > [65,0,0,0,0,0,0,0,1,0,0,0,-15,85,] > <65,0,0,0,0,0,0,0,1,0,0,0,-15,85,97,> > [65,0,0,0,1,0,0,0,1,0,0,0,-15,85,] > <65,0,0,0,1,0,0,0,1,0,0,0,-15,85,97,> > [65,0,0,0,0,0,0,0,1,0,0,0,-14,85,] > <65,0,0,0,0,0,0,0,1,0,0,0,-14,85,97,> > [65,0,0,0,1,0,0,0,1,0,0,0,-14,85,] > <65,0,0,0,1,0,0,0,1,0,0,0,-14,85,97,> > [65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,] > 1 453.26 > [65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,] > <65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,97,> > <65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,97,> > [singularity:13509] *** An error occurred in MPI_Barrier > [singularity:13509] *** reported by process [2239889409,140733193388033] > [singularity:13509] *** on communicator MPI_COMM_WORLD > [singularity:13509] *** MPI_ERR_TRUNCATE: message truncated > [singularity:13509] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [singularity:13509] ***and potentially your MPI job) > [singularity:13507] 1 more process has sent help message help-mpi-errors.txt > / mpi_errors_are_fatal > [singularity:13507] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > alex@singularity:~/huji/benchmarks/mpi/osu-micro-benchmarks-3.5.2$ > > On 04/25/2012 04:35 PM, George Bosilca wrote: >> Alex, >> >> You got the banner of the FT benchmark, so I guess at least the rank 0 >> successfully completed the MPI_Init call. This is a hint that you should >> investigate more into the point-to-point logic of your mosix BTL. >> >> george. >> >> On Apr 25, 2012, at 09:30 , Alex Margolin wrote: >> >>> NAS Parallel Benchmarks 3.3 -- FT Benchmark >>> >>> No input file inputft.data. Using compiled defaults >>> Size: 64x 64x 64 >>> Iterations : 6 >>> Number of processes : 4 >>> Processor array : 1x 4 >>> Layout type : 1D >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel