[OMPI devel] How to debug segv

2012-04-25 Thread Alex Margolin

Hi,

I'm getting a segv error off my build of the trunk. I know that my BTL 
module is responsible ("-mca btl self,tcp" works, "-mca btl self,mosix" 
fails). Smaller/simpler test applications pass, NPB doesn't. Can anyone 
suggest how to proceed with debugging this? my attempts include some 
debug printouts, and GDB which appears below... What can I do next?


I'll appreciate any input,
Alex

alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun --debug-daemons -d -n 
4 xterm -l -e gdb ft.S.4
[singularity:07557] procdir: 
/tmp/openmpi-sessions-alex@singularity_0/44228/0/0

[singularity:07557] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/0
[singularity:07557] top: openmpi-sessions-alex@singularity_0
[singularity:07557] tmp: /tmp
[singularity:07557] [[44228,0],0] hostfile: checking hostfile 
/home/alex/huji/ompi/etc/openmpi-default-hostfile for nodes
[singularity:07557] [[44228,0],0] hostfile: filtering nodes through 
hostfile /home/alex/huji/ompi/etc/openmpi-default-hostfile
[singularity:07557] [[44228,0],0] orted:comm:process_commands() 
Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS

[singularity:07557] [[44228,0],0] orted_cmd: received add_local_procs
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_forward_output = 0
  MPIR_proctable_size = 4
  MPIR_proctable:
(i, host, exe, pid) = (0, singularity, /usr/bin/xterm, 7558)
(i, host, exe, pid) = (1, singularity, /usr/bin/xterm, 7559)
(i, host, exe, pid) = (2, singularity, /usr/bin/xterm, 7560)
(i, host, exe, pid) = (3, singularity, /usr/bin/xterm, 7561)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[singularity:07592] procdir: 
/tmp/openmpi-sessions-alex@singularity_0/44228/1/3

[singularity:07592] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
[singularity:07592] top: openmpi-sessions-alex@singularity_0
[singularity:07592] tmp: /tmp
[singularity:07557] [[44228,0],0] orted:comm:process_commands() 
Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
[singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from 
local proc [[44228,1],3]

[singularity:07592] [[44228,1],3] decode:nidmap decoding nodemap
[singularity:07592] [[44228,1],3] decode:nidmap decoding 1 nodes
[singularity:07592] [[44228,1],3] node[0].name singularity daemon 0
[singularity:07594] procdir: 
/tmp/openmpi-sessions-alex@singularity_0/44228/1/1

[singularity:07594] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
[singularity:07594] top: openmpi-sessions-alex@singularity_0
[singularity:07594] tmp: /tmp
[singularity:07557] [[44228,0],0] orted:comm:process_commands() 
Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
[singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from 
local proc [[44228,1],1]

[singularity:07594] [[44228,1],1] decode:nidmap decoding nodemap
[singularity:07594] [[44228,1],1] decode:nidmap decoding 1 nodes
[singularity:07594] [[44228,1],1] node[0].name singularity daemon 0
[singularity:07596] procdir: 
/tmp/openmpi-sessions-alex@singularity_0/44228/1/0

[singularity:07596] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
[singularity:07596] top: openmpi-sessions-alex@singularity_0
[singularity:07596] tmp: /tmp
[singularity:07557] [[44228,0],0] orted:comm:process_commands() 
Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
[singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from 
local proc [[44228,1],0]

[singularity:07596] [[44228,1],0] decode:nidmap decoding nodemap
[singularity:07596] [[44228,1],0] decode:nidmap decoding 1 nodes
[singularity:07596] [[44228,1],0] node[0].name singularity daemon 0
[singularity:07598] procdir: 
/tmp/openmpi-sessions-alex@singularity_0/44228/1/2

[singularity:07598] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
[singularity:07598] top: openmpi-sessions-alex@singularity_0
[singularity:07598] tmp: /tmp
[singularity:07557] [[44228,0],0] orted:comm:process_commands() 
Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
[singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from 
local proc [[44228,1],2]

[singularity:07598] [[44228,1],2] decode:nidmap decoding nodemap
[singularity:07598] [[44228,1],2] decode:nidmap decoding 1 nodes
[singularity:07598] [[44228,1],2] node[0].name singularity daemon 0
[singularity:07557] [[44228,0],0] orted:comm:process_commands() 
Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS

[singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs
[singularity:07557] [[44228,0],0] orted:comm:message_local_procs 
delivering message to job [44228,1] tag 30
[singularity:07557] [[44228,0],0] orted:comm:process_commands() 
Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS

[singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs
[singularity:07557] [[44228,0],0] orted:comm:message_local_procs 
delivering message to job [44228,1] tag 30
[singularity:07557] [[44228,0],0]:errmgr_default_hnp.c(418) updating 
exit status to 

Re: [OMPI devel] How to debug segv

2012-04-25 Thread Ralph Castain
Strange that your code didn't generate any symbols - is that a mosix thing? 
Have you tried just adding opal_output (so it goes to a special diagnostic 
output channel) statements in your code to see where the segfault is occurring?

It looks like you are getting thru orte_init. You could add -mca 
grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, 
then you are probably failing in add_procs.


On Apr 25, 2012, at 5:05 AM, Alex Margolin wrote:

> Hi,
> 
> I'm getting a segv error off my build of the trunk. I know that my BTL module 
> is responsible ("-mca btl self,tcp" works, "-mca btl self,mosix" fails). 
> Smaller/simpler test applications pass, NPB doesn't. Can anyone suggest how 
> to proceed with debugging this? my attempts include some debug printouts, and 
> GDB which appears below... What can I do next?
> 
> I'll appreciate any input,
> Alex
> 
> alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun --debug-daemons -d -n 4 
> xterm -l -e gdb ft.S.4
> [singularity:07557] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/0/0
> [singularity:07557] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/0
> [singularity:07557] top: openmpi-sessions-alex@singularity_0
> [singularity:07557] tmp: /tmp
> [singularity:07557] [[44228,0],0] hostfile: checking hostfile 
> /home/alex/huji/ompi/etc/openmpi-default-hostfile for nodes
> [singularity:07557] [[44228,0],0] hostfile: filtering nodes through hostfile 
> /home/alex/huji/ompi/etc/openmpi-default-hostfile
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_ADD_LOCAL_PROCS
> [singularity:07557] [[44228,0],0] orted_cmd: received add_local_procs
>  MPIR_being_debugged = 0
>  MPIR_debug_state = 1
>  MPIR_partial_attach_ok = 1
>  MPIR_i_am_starter = 0
>  MPIR_forward_output = 0
>  MPIR_proctable_size = 4
>  MPIR_proctable:
>(i, host, exe, pid) = (0, singularity, /usr/bin/xterm, 7558)
>(i, host, exe, pid) = (1, singularity, /usr/bin/xterm, 7559)
>(i, host, exe, pid) = (2, singularity, /usr/bin/xterm, 7560)
>(i, host, exe, pid) = (3, singularity, /usr/bin/xterm, 7561)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> [singularity:07592] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/1/3
> [singularity:07592] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
> [singularity:07592] top: openmpi-sessions-alex@singularity_0
> [singularity:07592] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local 
> proc [[44228,1],3]
> [singularity:07592] [[44228,1],3] decode:nidmap decoding nodemap
> [singularity:07592] [[44228,1],3] decode:nidmap decoding 1 nodes
> [singularity:07592] [[44228,1],3] node[0].name singularity daemon 0
> [singularity:07594] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/1/1
> [singularity:07594] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
> [singularity:07594] top: openmpi-sessions-alex@singularity_0
> [singularity:07594] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local 
> proc [[44228,1],1]
> [singularity:07594] [[44228,1],1] decode:nidmap decoding nodemap
> [singularity:07594] [[44228,1],1] decode:nidmap decoding 1 nodes
> [singularity:07594] [[44228,1],1] node[0].name singularity daemon 0
> [singularity:07596] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/1/0
> [singularity:07596] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
> [singularity:07596] top: openmpi-sessions-alex@singularity_0
> [singularity:07596] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local 
> proc [[44228,1],0]
> [singularity:07596] [[44228,1],0] decode:nidmap decoding nodemap
> [singularity:07596] [[44228,1],0] decode:nidmap decoding 1 nodes
> [singularity:07596] [[44228,1],0] node[0].name singularity daemon 0
> [singularity:07598] procdir: 
> /tmp/openmpi-sessions-alex@singularity_0/44228/1/2
> [singularity:07598] jobdir: /tmp/openmpi-sessions-alex@singularity_0/44228/1
> [singularity:07598] top: openmpi-sessions-alex@singularity_0
> [singularity:07598] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing 
> Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local 
> proc [[44228,1],2]
> [singularity:07598] [[44228,1],2] decode:nidmap decoding nodemap
> [singularity:07598] [[44228,1],2] decode:nidmap decoding 1 nodes
> [singularity:07598] [[44228,1],2] node[0].name singularity daemon 0
> [singularity:07557] [[44228,0],0] orted:comm:process_commands(

Re: [OMPI devel] How to debug segv

2012-04-25 Thread Alex Margolin

On 04/25/2012 02:57 PM, Ralph Castain wrote:

Strange that your code didn't generate any symbols - is that a mosix thing? 
Have you tried just adding opal_output (so it goes to a special diagnostic 
output channel) statements in your code to see where the segfault is occurring?

It looks like you are getting thru orte_init. You could add -mca 
grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, 
then you are probably failing in add_procs.

I guess the symbols are a mosix thing, but it should still show some 
sort of segmentation fault trace, no? maybe only the assembly opcode... 
It seems that the SEGV is detected, rather then caught. This may also be 
related to mosix - I'll check it with the mosix developer.


I added the parameter you suggested and appended the output. Modex seems 
to be working because I use it to exchange the IP and PID, and as you 
can see at the bottom these are received OK. I'll try debug printouts 
specifically in add_procs. Thanks for the advice!


alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun -mca 
grpcomm_base_verbose 5 -mca btl self,mosix -mca btl_base_verbose 100 -n 
4 ft.S.4

[singularity:08915] mca:base:select:(grpcomm) Querying component [bad]
[singularity:08915] mca:base:select:(grpcomm) Query of component [bad] 
set priority to 10

[singularity:08915] mca:base:select:(grpcomm) Selected component [bad]
[singularity:08915] [[37778,0],0] grpcomm:base:receive start comm
[singularity:08915] [[37778,0],0] grpcomm:bad:xcast sent to job 
[37778,0] tag 1

[singularity:08915] [[37778,0],0] grpcomm:xcast:recv:send_relay
[singularity:08915] [[37778,0],0] grpcomm:base:xcast updating nidmap
[singularity:08915] [[37778,0],0] orte:daemon:send_relay - recipient 
list is empty!

[singularity:08916] mca:base:select:(grpcomm) Querying component [bad]
[singularity:08916] mca:base:select:(grpcomm) Query of component [bad] 
set priority to 10

[singularity:08916] mca:base:select:(grpcomm) Selected component [bad]
[singularity:08916] [[37778,1],0] grpcomm:base:receive start comm
[singularity:08919] mca:base:select:(grpcomm) Querying component [bad]
[singularity:08919] mca:base:select:(grpcomm) Query of component [bad] 
set priority to 10

[singularity:08919] mca:base:select:(grpcomm) Selected component [bad]
[singularity:08919] [[37778,1],2] grpcomm:base:receive start comm
[singularity:08917] mca:base:select:(grpcomm) Querying component [bad]
[singularity:08917] mca:base:select:(grpcomm) Query of component [bad] 
set priority to 10

[singularity:08917] mca:base:select:(grpcomm) Selected component [bad]
[singularity:08917] [[37778,1],1] grpcomm:base:receive start comm
[singularity:08921] mca:base:select:(grpcomm) Querying component [bad]
[singularity:08921] mca:base:select:(grpcomm) Query of component [bad] 
set priority to 10

[singularity:08921] mca:base:select:(grpcomm) Selected component [bad]
[singularity:08921] [[37778,1],3] grpcomm:base:receive start comm
[singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting 
attribute MPI_THREAD_LEVEL data size 1
[singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting 
attribute OMPI_ARCH data size 11
[singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting 
attribute MPI_THREAD_LEVEL data size 1
[singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting 
attribute OMPI_ARCH data size 11
[singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting 
attribute MPI_THREAD_LEVEL data size 1
[singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting 
attribute OMPI_ARCH data size 11
[singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting 
attribute MPI_THREAD_LEVEL data size 1
[singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting 
attribute OMPI_ARCH data size 11

[singularity:08916] mca: base: components_open: Looking for btl components
[singularity:08916] mca: base: components_open: opening btl components
[singularity:08916] mca: base: components_open: found loaded component mosix
[singularity:08916] mca: base: components_open: component mosix register 
function successful
[singularity:08916] mca: base: components_open: component mosix open 
function successful

[singularity:08916] mca: base: components_open: found loaded component self
[singularity:08916] mca: base: components_open: component self has no 
register function
[singularity:08916] mca: base: components_open: component self open 
function successful

[singularity:08919] mca: base: components_open: Looking for btl components
[singularity:08917] mca: base: components_open: Looking for btl components
[singularity:08919] mca: base: components_open: opening btl components
[singularity:08919] mca: base: components_open: found loaded component mosix
[singularity:08919] mca: base: components_open: component mosix register 
function successful
[singularity:08919] mca: base: components_open: component mosix open 
function successful

[singularity:08919] mca: base: components_open: found loaded component self
[sin

Re: [OMPI devel] How to debug segv

2012-04-25 Thread Jeffrey Squyres
Another thing to try is to load up the core file in gdb and see if that gives 
you a valid stack trace of where exactly the segv occurred.


On Apr 25, 2012, at 9:30 AM, Alex Margolin wrote:

> On 04/25/2012 02:57 PM, Ralph Castain wrote:
>> Strange that your code didn't generate any symbols - is that a mosix thing? 
>> Have you tried just adding opal_output (so it goes to a special diagnostic 
>> output channel) statements in your code to see where the segfault is 
>> occurring?
>> 
>> It looks like you are getting thru orte_init. You could add -mca 
>> grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, 
>> then you are probably failing in add_procs.
>> 
> I guess the symbols are a mosix thing, but it should still show some sort of 
> segmentation fault trace, no? maybe only the assembly opcode... It seems that 
> the SEGV is detected, rather then caught. This may also be related to mosix - 
> I'll check it with the mosix developer.
> 
> I added the parameter you suggested and appended the output. Modex seems to 
> be working because I use it to exchange the IP and PID, and as you can see at 
> the bottom these are received OK. I'll try debug printouts specifically in 
> add_procs. Thanks for the advice!
> 
> alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun -mca grpcomm_base_verbose 
> 5 -mca btl self,mosix -mca btl_base_verbose 100 -n 4 ft.S.4
> [singularity:08915] mca:base:select:(grpcomm) Querying component [bad]
> [singularity:08915] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [singularity:08915] mca:base:select:(grpcomm) Selected component [bad]
> [singularity:08915] [[37778,0],0] grpcomm:base:receive start comm
> [singularity:08915] [[37778,0],0] grpcomm:bad:xcast sent to job [37778,0] tag 
> 1
> [singularity:08915] [[37778,0],0] grpcomm:xcast:recv:send_relay
> [singularity:08915] [[37778,0],0] grpcomm:base:xcast updating nidmap
> [singularity:08915] [[37778,0],0] orte:daemon:send_relay - recipient list is 
> empty!
> [singularity:08916] mca:base:select:(grpcomm) Querying component [bad]
> [singularity:08916] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [singularity:08916] mca:base:select:(grpcomm) Selected component [bad]
> [singularity:08916] [[37778,1],0] grpcomm:base:receive start comm
> [singularity:08919] mca:base:select:(grpcomm) Querying component [bad]
> [singularity:08919] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [singularity:08919] mca:base:select:(grpcomm) Selected component [bad]
> [singularity:08919] [[37778,1],2] grpcomm:base:receive start comm
> [singularity:08917] mca:base:select:(grpcomm) Querying component [bad]
> [singularity:08917] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [singularity:08917] mca:base:select:(grpcomm) Selected component [bad]
> [singularity:08917] [[37778,1],1] grpcomm:base:receive start comm
> [singularity:08921] mca:base:select:(grpcomm) Querying component [bad]
> [singularity:08921] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [singularity:08921] mca:base:select:(grpcomm) Selected component [bad]
> [singularity:08921] [[37778,1],3] grpcomm:base:receive start comm
> [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute 
> MPI_THREAD_LEVEL data size 1
> [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute 
> OMPI_ARCH data size 11
> [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute 
> MPI_THREAD_LEVEL data size 1
> [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute 
> OMPI_ARCH data size 11
> [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute 
> MPI_THREAD_LEVEL data size 1
> [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute 
> OMPI_ARCH data size 11
> [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute 
> MPI_THREAD_LEVEL data size 1
> [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute 
> OMPI_ARCH data size 11
> [singularity:08916] mca: base: components_open: Looking for btl components
> [singularity:08916] mca: base: components_open: opening btl components
> [singularity:08916] mca: base: components_open: found loaded component mosix
> [singularity:08916] mca: base: components_open: component mosix register 
> function successful
> [singularity:08916] mca: base: components_open: component mosix open function 
> successful
> [singularity:08916] mca: base: components_open: found loaded component self
> [singularity:08916] mca: base: components_open: component self has no 
> register function
> [singularity:08916] mca: base: components_open: component self open function 
> successful
> [singularity:08919] mca: base: components_open: Looking for btl components
> [singularity:08917] mca: base: components_open: Looking for btl components
> [singularity:08919] mca: base: components_open: opening btl

Re: [OMPI devel] How to debug segv

2012-04-25 Thread George Bosilca
Alex,

You got the banner of the FT benchmark, so I guess at least the rank 0 
successfully completed the MPI_Init call. This is a hint that you should 
investigate more into the point-to-point logic of your mosix BTL.

  george.

On Apr 25, 2012, at 09:30 , Alex Margolin wrote:

> NAS Parallel Benchmarks 3.3 -- FT Benchmark
> 
> No input file inputft.data. Using compiled defaults
> Size:   64x  64x  64
> Iterations  :  6
> Number of processes :  4
> Processor array : 1x   4
> Layout type : 1D




[OMPI devel] Fwd: GNU autoconf 2.69 released [stable]

2012-04-25 Thread Jeffrey Squyres
There are a number of new Autoconf macros that would be useful for OMPI's 
Fortran configury.  Meaning: we have klugearounds in our existing configury, 
but the new AC 2.69 macros are Better.

How would people feel about upgrading the autoconf requirement on the trunk to 
AC 2.69?

(Terry: please add this to the agenda for next Tuesday; thanks)


Begin forwarded message:

> From: Eric Blake 
> Subject: GNU autoconf 2.69 released [stable]
> Date: April 24, 2012 11:32:32 PM EDT
> To: info-...@gnu.org, Autoconf 
> Cc: autotools-annou...@gnu.org, "bug-autoc...@gnu.org" 
> Reply-To: Autoconf 
> 
> The GNU Autoconf team is pleased to announce the stable release of
> Autoconf 2.69.  Autoconf is an extensible package of M4 macros that
> produce shell scripts to automatically configure software source code
> packages.  These scripts can adapt the packages to many kinds of
> UNIX-like systems without manual user intervention.  Autoconf creates a
> configuration script for a package from a template file that lists the
> operating system features that the package can use, in the form of M4
> macro calls.
> 
> Among other improvements, this release fixes a couple of regressions
> introduced in previous releases, greatly enhances Fortran support, adds
> Go support, and updates the documentation license.  It also requires
> that developer have perl 5.6 or newer when running autoconf (although
> generated configure scripts remain independent of perl, as always).  See
> a more complete list below.
> 
> Here are the compressed sources:
>  http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.gz   (1.9MB)
>  http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.xz   (1.2MB)
> 
> Here are the GPG detached signatures[*]:
>  http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.gz.sig
>  http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.xz.sig
> 
> Use a mirror for higher download bandwidth:
>  http://www.gnu.org/order/ftp.html
> 
> [*] Use a .sig file to verify that the corresponding file (without the
> .sig suffix) is intact.  First, be sure to download both the .sig file
> and the corresponding tarball.  Then, run a command like this:
> 
>  gpg --verify autoconf-2.69.tar.gz.sig
> 
> If that command fails because you don't have the required public key,
> then run this command to import it:
> 
>  gpg --keyserver keys.gnupg.net --recv-keys A7A16B4A2527436A
> 
> and rerun the 'gpg --verify' command.
> 
> This release was bootstrapped with the following tools:
>  Automake 1.11.1
> 
> NEWS
> 
> * Noteworthy changes in release 2.69 (2012-04-24) [stable]
> 
> ** Autoconf now requires perl 5.6 or better (but generated configure
>   scripts continue to run without perl).
> 
> * Noteworthy changes in release 2.68b (2012-03-01) [beta]
>  Released by Eric Blake, based on git versions 2.68.*.
> 
> ** Autoconf-generated configure scripts now unconditionally re-execute
>   themselves with $CONFIG_SHELL, if that's set in the environment.
> 
> ** The texinfo documentation no longer specifies "front-cover" or
>   "back-cover" texts, so that it may now be included in Debian's
>   "main" section.
> 
> ** Support for the Go programming language has been added.  The new
>   macro AC_LANG_GO sets variables GOC and GOFLAGS.
> 
> ** AS_LITERAL_IF again treats '=' as a literal.  Regression introduced
>   in 2.66.
> 
> ** The macro AS_EXECUTABLE_P, present since 2.50, is now documented.
> 
> ** Macros
> 
> - AC_PROG_LN_S and AS_LN_S now fall back on 'cp -pR' (not 'cp -p') if
>  'ln -s' does not work.  This works better for symlinks to directories.
> 
> - New macro AC_HEADER_CHECK_STDBOOL.
> 
> - New and updated macros for Fortran support:
> 
>AC_FC_CHECK_BOUNDS to enable array bounds checking
>AC_F77_IMPLICIT_NONE and AC_FC_IMPLICIT_NONE to disable implicit integer
>AC_FC_MODULE_EXTENSION to compute the Fortran 90 module name extension
>AC_FC_MODULE_FLAG for the Fortran 90 module search path flag
>AC_FC_MODULE_OUTPUT_FLAG for the Fortran 90 module output directory flag
>AC_FC_PP_SRCEXT for preprocessed Fortran source files extensions
>AC_FC_PP_DEFINE for the Fortran preprocessor define flag
> 
> -- 
> Eric Blake, on behalf of
> The GNU Autoconf team
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] 1.6rc1 has been released

2012-04-25 Thread Jeff Squyres
Note that Open MPI 1.6 is the evolution of the 1.5 series -- it is not a new 
branch from the SVN trunk.  Hence, 1.6 is essentially a bunch of bug fixes on 
top of 1.5.5.

Please test:

http://www.open-mpi.org/software/ompi/v1.6/

(note that the 1.6 page is not linked to from anywhere on the OMPI site yet)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] libevent socket code

2012-04-25 Thread Nathan Hjelm

Anyone object if I #if 0 out all the socket code in libevent. We see lots of 
static compilation warnings because of that code and nothing in openmpi uses it.

-Nathan


Re: [OMPI devel] libevent socket code

2012-04-25 Thread Ralph Castain
Can't it be done with configuring --without-libevent-sockets or some such 
thing? I really hate munging the code directly as it creates lots of support 
issues and makes it harder to upgrade.


On Apr 25, 2012, at 10:45 AM, Nathan Hjelm wrote:

> Anyone object if I #if 0 out all the socket code in libevent. We see lots of 
> static compilation warnings because of that code and nothing in openmpi uses 
> it.
> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] libevent socket code

2012-04-25 Thread Jeff Squyres
On Apr 25, 2012, at 12:50 PM, Ralph Castain wrote:

> Can't it be done with configuring --without-libevent-sockets or some such 
> thing? I really hate munging the code directly as it creates lots of support 
> issues and makes it harder to upgrade.

If there's a libevent configure option we should be using, we can probably set 
that to be enabled by default.  Let me know.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] libevent socket code

2012-04-25 Thread Nathan Hjelm

Let me take a look. The code in question is in evutil.c and bufferevent_sock.c 
. If there is no option we might be able to get away with just removing these 
files from the Makefile.am.

-Nathan

On Wed, 25 Apr 2012, Jeff Squyres wrote:


On Apr 25, 2012, at 12:50 PM, Ralph Castain wrote:


Can't it be done with configuring --without-libevent-sockets or some such 
thing? I really hate munging the code directly as it creates lots of support 
issues and makes it harder to upgrade.


If there's a libevent configure option we should be using, we can probably set 
that to be enabled by default.  Let me know.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] How to debug segv

2012-04-25 Thread Alex Margolin

I guess you are right.

I started looking into the communication passing between processes and I 
may have found a problem with the way I handle "reserved" data requested 
at prepare_src()... I've tried to write pretty much the same as TCP (the 
relevant code is around "if(opal_convertor_need_buffers(convertor))") 
and when I copy the buffered data to (frag+1) the program works. When I 
try to optimize the code by allowing the segment to point to the 
original location, I get MPI_ERR_TRUNCATE. I've printed out the data 
sent and recieved, and what I got ("[]" for sent, "<>" for received, 
running osu_latency) is appended below.


Question is: Where is the code which is responsible for writing the 
reserved data?


Thanks,
Alex


Always assume opal_convertor_need_buffers - works (97 is the application 
data, preceded by 14 reserved bytes):


...
[65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,]
<65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,>
[65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
<65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
[65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
<65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
[65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
<65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,>
[65,0,0,0,1,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
...

Detect when not opal_convertor_need_buffers - fails:

...
[65,0,0,0,0,0,0,0,1,0,0,0,-15,85,]
<65,0,0,0,0,0,0,0,1,0,0,0,-15,85,97,>
[65,0,0,0,1,0,0,0,1,0,0,0,-15,85,]
<65,0,0,0,1,0,0,0,1,0,0,0,-15,85,97,>
[65,0,0,0,0,0,0,0,1,0,0,0,-14,85,]
<65,0,0,0,0,0,0,0,1,0,0,0,-14,85,97,>
[65,0,0,0,1,0,0,0,1,0,0,0,-14,85,]
<65,0,0,0,1,0,0,0,1,0,0,0,-14,85,97,>
[65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,]
1   453.26
[65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,]
<65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,97,>
<65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,97,>
[singularity:13509] *** An error occurred in MPI_Barrier
[singularity:13509] *** reported by process [2239889409,140733193388033]
[singularity:13509] *** on communicator MPI_COMM_WORLD
[singularity:13509] *** MPI_ERR_TRUNCATE: message truncated
[singularity:13509] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,

[singularity:13509] ***and potentially your MPI job)
[singularity:13507] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[singularity:13507] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages

alex@singularity:~/huji/benchmarks/mpi/osu-micro-benchmarks-3.5.2$

On 04/25/2012 04:35 PM, George Bosilca wrote:

Alex,

You got the banner of the FT benchmark, so I guess at least the rank 0 
successfully completed the MPI_Init call. This is a hint that you should 
investigate more into the point-to-point logic of your mosix BTL.

   george.

On Apr 25, 2012, at 09:30 , Alex Margolin wrote:


NAS Parallel Benchmarks 3.3 -- FT Benchmark

No input file inputft.data. Using compiled defaults
Size:   64x  64x  64
Iterations  :  6
Number of processes :  4
Processor array : 1x   4
Layout type : 1D


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] How to debug segv

2012-04-25 Thread Shamis, Pavel
Alex,
+1 vote for core. It is good starting point.

* If you can't (from some reason) generate the core file, you may drop while 
(1) somewhere in the init code and attach the gdb later.
* If you are looking for more user-friendly experience, you may try Allinea DDT 
(they have 30day trial version).

Regards,
Pasha.

> Another thing to try is to load up the core file in gdb and see if that gives 
> you a valid stack trace of where exactly the segv occurred.
>
>
> On Apr 25, 2012, at 9:30 AM, Alex Margolin wrote:
>
>> On 04/25/2012 02:57 PM, Ralph Castain wrote:
>>> Strange that your code didn't generate any symbols - is that a mosix thing? 
>>> Have you tried just adding opal_output (so it goes to a special diagnostic 
>>> output channel) statements in your code to see where the segfault is 
>>> occurring?
>>>
>>> It looks like you are getting thru orte_init. You could add -mca 
>>> grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, 
>>> then you are probably failing in add_procs.
>>>
>> I guess the symbols are a mosix thing, but it should still show some sort of 
>> segmentation fault trace, no? maybe only the assembly opcode... It seems 
>> that the SEGV is detected, rather then caught. This may also be related to 
>> mosix - I'll check it with the mosix developer.
>>
>> I added the parameter you suggested and appended the output. Modex seems to 
>> be working because I use it to exchange the IP and PID, and as you can see 
>> at the bottom these are received OK. I'll try debug printouts specifically 
>> in add_procs. Thanks for the advice!
>>
>> alex@singularity:~/huji/benchmarks/mpi/npb$ mpirun -mca grpcomm_base_verbose 
>> 5 -mca btl self,mosix -mca btl_base_verbose 100 -n 4 ft.S.4
>> [singularity:08915] mca:base:select:(grpcomm) Querying component [bad]
>> [singularity:08915] mca:base:select:(grpcomm) Query of component [bad] set 
>> priority to 10
>> [singularity:08915] mca:base:select:(grpcomm) Selected component [bad]
>> [singularity:08915] [[37778,0],0] grpcomm:base:receive start comm
>> [singularity:08915] [[37778,0],0] grpcomm:bad:xcast sent to job [37778,0] 
>> tag 1
>> [singularity:08915] [[37778,0],0] grpcomm:xcast:recv:send_relay
>> [singularity:08915] [[37778,0],0] grpcomm:base:xcast updating nidmap
>> [singularity:08915] [[37778,0],0] orte:daemon:send_relay - recipient list is 
>> empty!
>> [singularity:08916] mca:base:select:(grpcomm) Querying component [bad]
>> [singularity:08916] mca:base:select:(grpcomm) Query of component [bad] set 
>> priority to 10
>> [singularity:08916] mca:base:select:(grpcomm) Selected component [bad]
>> [singularity:08916] [[37778,1],0] grpcomm:base:receive start comm
>> [singularity:08919] mca:base:select:(grpcomm) Querying component [bad]
>> [singularity:08919] mca:base:select:(grpcomm) Query of component [bad] set 
>> priority to 10
>> [singularity:08919] mca:base:select:(grpcomm) Selected component [bad]
>> [singularity:08919] [[37778,1],2] grpcomm:base:receive start comm
>> [singularity:08917] mca:base:select:(grpcomm) Querying component [bad]
>> [singularity:08917] mca:base:select:(grpcomm) Query of component [bad] set 
>> priority to 10
>> [singularity:08917] mca:base:select:(grpcomm) Selected component [bad]
>> [singularity:08917] [[37778,1],1] grpcomm:base:receive start comm
>> [singularity:08921] mca:base:select:(grpcomm) Querying component [bad]
>> [singularity:08921] mca:base:select:(grpcomm) Query of component [bad] set 
>> priority to 10
>> [singularity:08921] mca:base:select:(grpcomm) Selected component [bad]
>> [singularity:08921] [[37778,1],3] grpcomm:base:receive start comm
>> [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute 
>> MPI_THREAD_LEVEL data size 1
>> [singularity:08916] [[37778,1],0] grpcomm:set_proc_attr: setting attribute 
>> OMPI_ARCH data size 11
>> [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute 
>> MPI_THREAD_LEVEL data size 1
>> [singularity:08919] [[37778,1],2] grpcomm:set_proc_attr: setting attribute 
>> OMPI_ARCH data size 11
>> [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute 
>> MPI_THREAD_LEVEL data size 1
>> [singularity:08917] [[37778,1],1] grpcomm:set_proc_attr: setting attribute 
>> OMPI_ARCH data size 11
>> [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute 
>> MPI_THREAD_LEVEL data size 1
>> [singularity:08921] [[37778,1],3] grpcomm:set_proc_attr: setting attribute 
>> OMPI_ARCH data size 11
>> [singularity:08916] mca: base: components_open: Looking for btl components
>> [singularity:08916] mca: base: components_open: opening btl components
>> [singularity:08916] mca: base: components_open: found loaded component mosix
>> [singularity:08916] mca: base: components_open: component mosix register 
>> function successful
>> [singularity:08916] mca: base: components_open: component mosix open 
>> function successful
>> [singularity:08916] mca: base: components_open: found loaded component self

Re: [OMPI devel] How to debug segv

2012-04-25 Thread George Bosilca

On Apr 25, 2012, at 13:59 , Alex Margolin wrote:

> I guess you are right.
> 
> I started looking into the communication passing between processes and I may 
> have found a problem with the way I handle "reserved" data requested at 
> prepare_src()... I've tried to write pretty much the same as TCP (the 
> relevant code is around "if(opal_convertor_need_buffers(convertor))") and 
> when I copy the buffered data to (frag+1) the program works. When I try to 
> optimize the code by allowing the segment to point to the original location, 
> I get MPI_ERR_TRUNCATE. I've printed out the data sent and recieved, and what 
> I got ("[]" for sent, "<>" for received, running osu_latency) is appended 
> below.
> 
> Question is: Where is the code which is responsible for writing the reserved 
> data?

It is the PML headers. Based on the error you reported OMPI is complaining 
about truncated data on an MPI_Barrier … that's quite bad as the barrier is one 
of the few operations that do not manipulate any data. I guess the PML headers 
are not located at the expected displacement in the fragment, so the PML is 
using wrong values.

  george.


> 
> Thanks,
> Alex
> 
> 
> Always assume opal_convertor_need_buffers - works (97 is the application 
> data, preceded by 14 reserved bytes):
> 
> ...
> [65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,]
> <65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,>
> [65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
> <65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
> <65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
> [65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
> <65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
> ...
> 
> Detect when not opal_convertor_need_buffers - fails:
> 
> ...
> [65,0,0,0,0,0,0,0,1,0,0,0,-15,85,]
> <65,0,0,0,0,0,0,0,1,0,0,0,-15,85,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,-15,85,]
> <65,0,0,0,1,0,0,0,1,0,0,0,-15,85,97,>
> [65,0,0,0,0,0,0,0,1,0,0,0,-14,85,]
> <65,0,0,0,0,0,0,0,1,0,0,0,-14,85,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,-14,85,]
> <65,0,0,0,1,0,0,0,1,0,0,0,-14,85,97,>
> [65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,]
> 1   453.26
> [65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,]
> <65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,97,>
> <65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,97,>
> [singularity:13509] *** An error occurred in MPI_Barrier
> [singularity:13509] *** reported by process [2239889409,140733193388033]
> [singularity:13509] *** on communicator MPI_COMM_WORLD
> [singularity:13509] *** MPI_ERR_TRUNCATE: message truncated
> [singularity:13509] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
> will now abort,
> [singularity:13509] ***and potentially your MPI job)
> [singularity:13507] 1 more process has sent help message help-mpi-errors.txt 
> / mpi_errors_are_fatal
> [singularity:13507] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
> all help / error messages
> alex@singularity:~/huji/benchmarks/mpi/osu-micro-benchmarks-3.5.2$
> 
> On 04/25/2012 04:35 PM, George Bosilca wrote:
>> Alex,
>> 
>> You got the banner of the FT benchmark, so I guess at least the rank 0 
>> successfully completed the MPI_Init call. This is a hint that you should 
>> investigate more into the point-to-point logic of your mosix BTL.
>> 
>>   george.
>> 
>> On Apr 25, 2012, at 09:30 , Alex Margolin wrote:
>> 
>>> NAS Parallel Benchmarks 3.3 -- FT Benchmark
>>> 
>>> No input file inputft.data. Using compiled defaults
>>> Size:   64x  64x  64
>>> Iterations  :  6
>>> Number of processes :  4
>>> Processor array : 1x   4
>>> Layout type : 1D
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel