Jeff,
I finally worked out why I couldn't reproduce the problem. You're not
going to like it though.
As before, this is running on FC4 and I'm using 1.0.1r8453 (the 1.0.1
release version).
First test:
$ ./configure --with-devel-headers --prefix=/usr/local/ompi
$ make
$ make install
$ mpicc -o x x.c
$ mpirun -d -np 2 ./x
[localhost.localdomain:10085] [0,0,0] setting up session dir with
[localhost.localdomain:10085] universe default-universe
[localhost.localdomain:10085] user greg
[localhost.localdomain:10085] host localhost.localdomain
[localhost.localdomain:10085] jobid 0
[localhost.localdomain:10085] procid 0
[localhost.localdomain:10085] procdir: /tmp/openmpi-sessions-
greg@localhost.localdomain_0/default-universe/0/0
[localhost.localdomain:10085] jobdir: /tmp/openmpi-sessions-
greg@localhost.localdomain_0/default-universe/0
[localhost.localdomain:10085] unidir: /tmp/openmpi-sessions-
greg@localhost.localdomain_0/default-universe
[localhost.localdomain:10085] top: openmpi-sessions-
greg@localhost.localdomain_0
[localhost.localdomain:10085] tmp: /tmp
[localhost.localdomain:10085] [0,0,0] contact_file /tmp/openmpi-
sessions-greg@localhost.localdomain_0/default-universe/universe-
setup.txt
[localhost.localdomain:10085] [0,0,0] wrote setup file
[localhost.localdomain:10085] spawn: in job_state_callback(jobid = 1,
state = 0x1)
[localhost.localdomain:10085] pls:rsh: local csh: 0, local bash: 1
[localhost.localdomain:10085] pls:rsh: assuming same remote shell as
local shell
[localhost.localdomain:10085] pls:rsh: remote csh: 0, remote bash: 1
[localhost.localdomain:10085] pls:rsh: final template argv:
[localhost.localdomain:10085] pls:rsh: ssh <template> orted --
debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --
nodename <template> --universe greg@localhost.localdomain:default-
universe --nsreplica "0.0.0;tcp://10.0.1.103:32818" --gprreplica
"0.0.0;tcp://10.0.1.103:32818" --mpi-call-yield 0
[localhost.localdomain:10085] pls:rsh: launching on node localhost
[localhost.localdomain:10085] pls:rsh: oversubscribed -- setting
mpi_yield_when_idle to 1 (1 2)
[localhost.localdomain:10085] sess_dir_finalize: proc session dir not
empty - leaving
[localhost.localdomain:10085] spawn: in job_state_callback(jobid = 1,
state = 0xa)
mpirun noticed that job rank 1 with PID 0 on node "localhost" exited
on signal 11.
[localhost.localdomain:10085] sess_dir_finalize: proc session dir not
empty - leaving
[localhost.localdomain:10085] spawn: in job_state_callback(jobid = 1,
state = 0x9)
[localhost.localdomain:10085] ERROR: A daemon on node localhost
failed to startas expected.
[localhost.localdomain:10085] ERROR: There may be more information
available from
[localhost.localdomain:10085] ERROR: the remote shell (see above).
[localhost.localdomain:10085] The daemon received a signal 11 (with
core).
1 additional process aborted (not shown)
[localhost.localdomain:10085] sess_dir_finalize: found proc session
dir empty -deleting
[localhost.localdomain:10085] sess_dir_finalize: found job session
dir empty - deleting
[localhost.localdomain:10085] sess_dir_finalize: found univ session
dir empty -deleting
[localhost.localdomain:10085] sess_dir_finalize: found top session
dir empty - deleting
Here's the stacktracefrom the core file:
#0 0x00e93fe8 in orte_pls_rsh_launch ()
from /usr/local/ompi/lib/openmpi/mca_pls_rsh.so
#1 0x0023c642 in orte_rmgr_urm_spawn ()
from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
#2 0x0804a0d4 in orterun (argc=5, argv=0xbfab2e84) at orterun.c:373
#3 0x08049b16 in main (argc=5, argv=0xbfab2e84) at main.c:13
Now reconfigure with debugging enabled:
$ CFLAGS=-g ./configure --with-devel-headers --prefix=/usr/local/ompi
$ make
$ make install
$ mpicc -o x x.c
$ mpirun -d -np 2 ./x
[localhost.localdomain:10166] [0,0,0] setting up session dir with
[localhost.localdomain:10166] universe default-universe
[localhost.localdomain:10166] user greg
[localhost.localdomain:10166] host localhost.localdomain
[localhost.localdomain:10166] jobid 0
[localhost.localdomain:10166] procid 0
[localhost.localdomain:10166] procdir: /tmp/openmpi-sessions-
greg@localhost.localdomain_0/default-universe/0/0
[localhost.localdomain:10166] jobdir: /tmp/openmpi-sessions-
greg@localhost.localdomain_0/default-universe/0
[localhost.localdomain:10166] unidir: /tmp/openmpi-sessions-
greg@localhost.localdomain_0/default-universe
[localhost.localdomain:10166] top: openmpi-sessions-
greg@localhost.localdomain_0
[localhost.localdomain:10166] tmp: /tmp
[localhost.localdomain:10166] [0,0,0] contact_file /tmp/openmpi-
sessions-greg@localhost.localdomain_0/default-universe/universe-
setup.txt
[localhost.localdomain:10166] [0,0,0] wrote setup file
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x1)
[localhost.localdomain:10166] pls:rsh: local csh: 0, local bash: 1
[localhost.localdomain:10166] pls:rsh: assuming same remote shell as
local shell
[localhost.localdomain:10166] pls:rsh: remote csh: 0, remote bash: 1
[localhost.localdomain:10166] pls:rsh: final template argv:
[localhost.localdomain:10166] pls:rsh: ssh <template> orted --
debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --
nodename <template> --universe greg@localhost.localdomain:default-
universe --nsreplica "0.0.0;tcp://10.0.1.103:32820" --gprreplica
"0.0.0;tcp://10.0.1.103:32820" --mpi-call-yield 0
[localhost.localdomain:10166] pls:rsh: launching on node localhost
[localhost.localdomain:10166] pls:rsh: oversubscribed -- setting
mpi_yield_when_idle to 1 (1 2)
[localhost.localdomain:10166] pls:rsh: localhost is a LOCAL node
[localhost.localdomain:10166] pls:rsh: executing: orted --debug --
bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename
localhost --universe greg@localhost.localdomain:default-universe --
nsreplica "0.0.0;tcp://10.0.1.103:32820" --gprreplica "0.0.0;tcp://
10.0.1.103:32820" --mpi-call-yield 1
[localhost.localdomain:10167] [0,0,1] setting up session dir with
[localhost.localdomain:10167] universe default-universe
[localhost.localdomain:10167] user greg
[localhost.localdomain:10167] host localhost
[localhost.localdomain:10167] jobid 0
[localhost.localdomain:10167] procid 1
[localhost.localdomain:10167] procdir: /tmp/openmpi-sessions-
greg@localhost_0/default-universe/0/1
[localhost.localdomain:10167] jobdir: /tmp/openmpi-sessions-
greg@localhost_0/default-universe/0
[localhost.localdomain:10167] unidir: /tmp/openmpi-sessions-
greg@localhost_0/default-universe
[localhost.localdomain:10167] top: openmpi-sessions-greg@localhost_0
[localhost.localdomain:10167] tmp: /tmp
[localhost.localdomain:10169] [0,1,1] setting up session dir with
[localhost.localdomain:10169] universe default-universe
[localhost.localdomain:10169] user greg
[localhost.localdomain:10169] host localhost
[localhost.localdomain:10169] jobid 1
[localhost.localdomain:10169] procid 1
[localhost.localdomain:10169] procdir: /tmp/openmpi-sessions-
greg@localhost_0/default-universe/1/1
[localhost.localdomain:10169] jobdir: /tmp/openmpi-sessions-
greg@localhost_0/default-universe/1
[localhost.localdomain:10169] unidir: /tmp/openmpi-sessions-
greg@localhost_0/default-universe
[localhost.localdomain:10169] top: openmpi-sessions-greg@localhost_0
[localhost.localdomain:10169] tmp: /tmp
[localhost.localdomain:10170] [0,1,0] setting up session dir with
[localhost.localdomain:10170] universe default-universe
[localhost.localdomain:10170] user greg
[localhost.localdomain:10170] host localhost
[localhost.localdomain:10170] jobid 1
[localhost.localdomain:10170] procid 0
[localhost.localdomain:10170] procdir: /tmp/openmpi-sessions-
greg@localhost_0/default-universe/1/0
[localhost.localdomain:10170] jobdir: /tmp/openmpi-sessions-
greg@localhost_0/default-universe/1
[localhost.localdomain:10170] unidir: /tmp/openmpi-sessions-
greg@localhost_0/default-universe
[localhost.localdomain:10170] top: openmpi-sessions-greg@localhost_0
[localhost.localdomain:10170] tmp: /tmp
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x3)
[localhost.localdomain:10166] Info: Setting up debugger process table
for applications
MPIR_being_debugged = 0
MPIR_debug_gate = 0
MPIR_debug_state = 1
MPIR_acquired_pre_main = 0
MPIR_i_am_starter = 0
MPIR_proctable_size = 2
MPIR_proctable:
(i, host, exe, pid) = (0, localhost, ./x, 10169)
(i, host, exe, pid) = (1, localhost, ./x, 10170)
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x4)
[localhost.localdomain:10170] [0,1,0] ompi_mpi_init completed
[localhost.localdomain:10169] [0,1,1] ompi_mpi_init completed
my tid is 0
my tid is 1
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x7)
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x8)
[localhost.localdomain:10167] sess_dir_finalize: proc session dir not
empty - leaving
[localhost.localdomain:10170] sess_dir_finalize: found proc session
dir empty -deleting
[localhost.localdomain:10169] sess_dir_finalize: found proc session
dir empty -deleting
[localhost.localdomain:10170] sess_dir_finalize: job session dir not
empty - leaving
[localhost.localdomain:10167] sess_dir_finalize: proc session dir not
empty - leaving
[localhost.localdomain:10167] orted: job_state_callback(jobid = 1,
state = ORTE_PROC_STATE_TERMINATED)
[localhost.localdomain:10167] sess_dir_finalize: found proc session
dir empty -deleting
[localhost.localdomain:10167] sess_dir_finalize: found job session
dir empty - deleting
[localhost.localdomain:10167] sess_dir_finalize: found univ session
dir empty -deleting
[localhost.localdomain:10167] sess_dir_finalize: found top session
dir empty - deleting
So it looks like you're doing something that breaks with normal
optimization (I presume -O3) but works otherwise.
FC4 is using gcc 4.0.0 20050519.
Suggestions on how to proceed would be appreciated.
Greg
On Dec 1, 2005, at 9:19 AM, Jeff Squyres wrote:
On Dec 1, 2005, at 10:58 AM, Greg Watson wrote:
@#$%^& it! I can't get the problem to manifest for either branch now.
Well, that's good for me. :-)
FWIW, the problem existed on systems that could/would return different
addresses in different processes from mmap() for memory that was
common
to all of them. E.g., if processes A and B share common memory Z, A
would get virtual address M for Z, and B would get virtual address N
(as opposed to both A and B getting virtual address M).
Here's the history of what happened...
We had code paths for that situation in the sm btl (i.e., when A and B
get different addresses for the same shared memory), but
unbeknownst to
us, mmap() on most systems seems to return the same value in A and B
(this could be a side-effect of typical MPI testing memory usage
patterns... I don't know).
But FC3 and FC4 consistently did not seem to follow this pattern --
they would return different values from mmap() in different processes.
Unfortunately, we did not do any testing on FC3 or FC4 systems until a
few weeks before SC, and discovered that some of our
previously-unknowingly-untested sm bootstrap code paths had some bugs.
I fixed all of those and brought [almost all of] them over to the 1.0
release branch. I missed one patch in v1.0, but it will be
included in
v1.0.1, to be released shortly.
So I'd be surprised if you were still seeing this bug in either branch
-- as far as I know, I fixed all the issues. More specifically, if
you
see this behavior, it will probably be in *both* branches.
Let me know if you run across it again. Thanks!
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel