Jeff,

I finally worked out why I couldn't reproduce the problem. You're not going to like it though.

As before, this is running on FC4 and I'm using 1.0.1r8453 (the 1.0.1 release version).

First test:

$ ./configure --with-devel-headers --prefix=/usr/local/ompi
$ make
$ make install
$ mpicc -o x x.c
$ mpirun -d -np 2 ./x

[localhost.localdomain:10085] [0,0,0] setting up session dir with
[localhost.localdomain:10085]   universe default-universe
[localhost.localdomain:10085]   user greg
[localhost.localdomain:10085]   host localhost.localdomain
[localhost.localdomain:10085]   jobid 0
[localhost.localdomain:10085]   procid 0
[localhost.localdomain:10085] procdir: /tmp/openmpi-sessions- greg@localhost.localdomain_0/default-universe/0/0 [localhost.localdomain:10085] jobdir: /tmp/openmpi-sessions- greg@localhost.localdomain_0/default-universe/0 [localhost.localdomain:10085] unidir: /tmp/openmpi-sessions- greg@localhost.localdomain_0/default-universe [localhost.localdomain:10085] top: openmpi-sessions- greg@localhost.localdomain_0
[localhost.localdomain:10085] tmp: /tmp
[localhost.localdomain:10085] [0,0,0] contact_file /tmp/openmpi- sessions-greg@localhost.localdomain_0/default-universe/universe- setup.txt
[localhost.localdomain:10085] [0,0,0] wrote setup file
[localhost.localdomain:10085] spawn: in job_state_callback(jobid = 1, state = 0x1)
[localhost.localdomain:10085] pls:rsh: local csh: 0, local bash: 1
[localhost.localdomain:10085] pls:rsh: assuming same remote shell as local shell
[localhost.localdomain:10085] pls:rsh: remote csh: 0, remote bash: 1
[localhost.localdomain:10085] pls:rsh: final template argv:
[localhost.localdomain:10085] pls:rsh: ssh <template> orted -- debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 -- nodename <template> --universe greg@localhost.localdomain:default- universe --nsreplica "0.0.0;tcp://10.0.1.103:32818" --gprreplica "0.0.0;tcp://10.0.1.103:32818" --mpi-call-yield 0
[localhost.localdomain:10085] pls:rsh: launching on node localhost
[localhost.localdomain:10085] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to 1 (1 2) [localhost.localdomain:10085] sess_dir_finalize: proc session dir not empty - leaving [localhost.localdomain:10085] spawn: in job_state_callback(jobid = 1, state = 0xa) mpirun noticed that job rank 1 with PID 0 on node "localhost" exited on signal 11. [localhost.localdomain:10085] sess_dir_finalize: proc session dir not empty - leaving [localhost.localdomain:10085] spawn: in job_state_callback(jobid = 1, state = 0x9) [localhost.localdomain:10085] ERROR: A daemon on node localhost failed to startas expected. [localhost.localdomain:10085] ERROR: There may be more information available from
[localhost.localdomain:10085] ERROR: the remote shell (see above).
[localhost.localdomain:10085] The daemon received a signal 11 (with core).
1 additional process aborted (not shown)
[localhost.localdomain:10085] sess_dir_finalize: found proc session dir empty -deleting [localhost.localdomain:10085] sess_dir_finalize: found job session dir empty - deleting [localhost.localdomain:10085] sess_dir_finalize: found univ session dir empty -deleting [localhost.localdomain:10085] sess_dir_finalize: found top session dir empty - deleting

Here's the stacktracefrom the core file:

#0  0x00e93fe8 in orte_pls_rsh_launch ()
   from /usr/local/ompi/lib/openmpi/mca_pls_rsh.so
#1  0x0023c642 in orte_rmgr_urm_spawn ()
   from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
#2  0x0804a0d4 in orterun (argc=5, argv=0xbfab2e84) at orterun.c:373
#3  0x08049b16 in main (argc=5, argv=0xbfab2e84) at main.c:13

Now reconfigure with debugging enabled:

$ CFLAGS=-g ./configure --with-devel-headers --prefix=/usr/local/ompi
$ make
$ make install
$ mpicc -o x x.c
$ mpirun -d -np 2 ./x

[localhost.localdomain:10166] [0,0,0] setting up session dir with
[localhost.localdomain:10166]   universe default-universe
[localhost.localdomain:10166]   user greg
[localhost.localdomain:10166]   host localhost.localdomain
[localhost.localdomain:10166]   jobid 0
[localhost.localdomain:10166]   procid 0
[localhost.localdomain:10166] procdir: /tmp/openmpi-sessions- greg@localhost.localdomain_0/default-universe/0/0 [localhost.localdomain:10166] jobdir: /tmp/openmpi-sessions- greg@localhost.localdomain_0/default-universe/0 [localhost.localdomain:10166] unidir: /tmp/openmpi-sessions- greg@localhost.localdomain_0/default-universe [localhost.localdomain:10166] top: openmpi-sessions- greg@localhost.localdomain_0
[localhost.localdomain:10166] tmp: /tmp
[localhost.localdomain:10166] [0,0,0] contact_file /tmp/openmpi- sessions-greg@localhost.localdomain_0/default-universe/universe- setup.txt
[localhost.localdomain:10166] [0,0,0] wrote setup file
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1, state = 0x1)
[localhost.localdomain:10166] pls:rsh: local csh: 0, local bash: 1
[localhost.localdomain:10166] pls:rsh: assuming same remote shell as local shell
[localhost.localdomain:10166] pls:rsh: remote csh: 0, remote bash: 1
[localhost.localdomain:10166] pls:rsh: final template argv:
[localhost.localdomain:10166] pls:rsh: ssh <template> orted -- debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 -- nodename <template> --universe greg@localhost.localdomain:default- universe --nsreplica "0.0.0;tcp://10.0.1.103:32820" --gprreplica "0.0.0;tcp://10.0.1.103:32820" --mpi-call-yield 0
[localhost.localdomain:10166] pls:rsh: launching on node localhost
[localhost.localdomain:10166] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to 1 (1 2)
[localhost.localdomain:10166] pls:rsh: localhost is a LOCAL node
[localhost.localdomain:10166] pls:rsh: executing: orted --debug -- bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe greg@localhost.localdomain:default-universe -- nsreplica "0.0.0;tcp://10.0.1.103:32820" --gprreplica "0.0.0;tcp:// 10.0.1.103:32820" --mpi-call-yield 1
[localhost.localdomain:10167] [0,0,1] setting up session dir with
[localhost.localdomain:10167]   universe default-universe
[localhost.localdomain:10167]   user greg
[localhost.localdomain:10167]   host localhost
[localhost.localdomain:10167]   jobid 0
[localhost.localdomain:10167]   procid 1
[localhost.localdomain:10167] procdir: /tmp/openmpi-sessions- greg@localhost_0/default-universe/0/1 [localhost.localdomain:10167] jobdir: /tmp/openmpi-sessions- greg@localhost_0/default-universe/0 [localhost.localdomain:10167] unidir: /tmp/openmpi-sessions- greg@localhost_0/default-universe
[localhost.localdomain:10167] top: openmpi-sessions-greg@localhost_0
[localhost.localdomain:10167] tmp: /tmp
[localhost.localdomain:10169] [0,1,1] setting up session dir with
[localhost.localdomain:10169]   universe default-universe
[localhost.localdomain:10169]   user greg
[localhost.localdomain:10169]   host localhost
[localhost.localdomain:10169]   jobid 1
[localhost.localdomain:10169]   procid 1
[localhost.localdomain:10169] procdir: /tmp/openmpi-sessions- greg@localhost_0/default-universe/1/1 [localhost.localdomain:10169] jobdir: /tmp/openmpi-sessions- greg@localhost_0/default-universe/1 [localhost.localdomain:10169] unidir: /tmp/openmpi-sessions- greg@localhost_0/default-universe
[localhost.localdomain:10169] top: openmpi-sessions-greg@localhost_0
[localhost.localdomain:10169] tmp: /tmp
[localhost.localdomain:10170] [0,1,0] setting up session dir with
[localhost.localdomain:10170]   universe default-universe
[localhost.localdomain:10170]   user greg
[localhost.localdomain:10170]   host localhost
[localhost.localdomain:10170]   jobid 1
[localhost.localdomain:10170]   procid 0
[localhost.localdomain:10170] procdir: /tmp/openmpi-sessions- greg@localhost_0/default-universe/1/0 [localhost.localdomain:10170] jobdir: /tmp/openmpi-sessions- greg@localhost_0/default-universe/1 [localhost.localdomain:10170] unidir: /tmp/openmpi-sessions- greg@localhost_0/default-universe
[localhost.localdomain:10170] top: openmpi-sessions-greg@localhost_0
[localhost.localdomain:10170] tmp: /tmp
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1, state = 0x3) [localhost.localdomain:10166] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, localhost, ./x, 10169)
    (i, host, exe, pid) = (1, localhost, ./x, 10170)
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1, state = 0x4)
[localhost.localdomain:10170] [0,1,0] ompi_mpi_init completed
[localhost.localdomain:10169] [0,1,1] ompi_mpi_init completed
my tid is 0
my tid is 1
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1, state = 0x7) [localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1, state = 0x8) [localhost.localdomain:10167] sess_dir_finalize: proc session dir not empty - leaving [localhost.localdomain:10170] sess_dir_finalize: found proc session dir empty -deleting [localhost.localdomain:10169] sess_dir_finalize: found proc session dir empty -deleting [localhost.localdomain:10170] sess_dir_finalize: job session dir not empty - leaving [localhost.localdomain:10167] sess_dir_finalize: proc session dir not empty - leaving [localhost.localdomain:10167] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED) [localhost.localdomain:10167] sess_dir_finalize: found proc session dir empty -deleting [localhost.localdomain:10167] sess_dir_finalize: found job session dir empty - deleting [localhost.localdomain:10167] sess_dir_finalize: found univ session dir empty -deleting [localhost.localdomain:10167] sess_dir_finalize: found top session dir empty - deleting

So it looks like you're doing something that breaks with normal optimization (I presume -O3) but works otherwise.

FC4 is using gcc 4.0.0 20050519.

Suggestions on how to proceed would be appreciated.

Greg

On Dec 1, 2005, at 9:19 AM, Jeff Squyres wrote:

On Dec 1, 2005, at 10:58 AM, Greg Watson wrote:

@#$%^& it! I can't get the problem to manifest for either branch now.

Well, that's good for me.  :-)

FWIW, the problem existed on systems that could/would return different
addresses in different processes from mmap() for memory that was common
to all of them.  E.g., if processes A and B share common memory Z, A
would get virtual address M for Z, and B would get virtual address N
(as opposed to both A and B getting virtual address M).

Here's the history of what happened...

We had code paths for that situation in the sm btl (i.e., when A and B
get different addresses for the same shared memory), but unbeknownst to
us, mmap() on most systems seems to return the same value in A and B
(this could be a side-effect of typical MPI testing memory usage
patterns... I don't know).

But FC3 and FC4 consistently did not seem to follow this pattern --
they would return different values from mmap() in different processes.
Unfortunately, we did not do any testing on FC3 or FC4 systems until a
few weeks before SC, and discovered that some of our
previously-unknowingly-untested sm bootstrap code paths had some bugs.
I fixed all of those and brought [almost all of] them over to the 1.0
release branch. I missed one patch in v1.0, but it will be included in
v1.0.1, to be released shortly.

So I'd be surprised if you were still seeing this bug in either branch
-- as far as I know, I fixed all the issues. More specifically, if you see this behavior, it will probably be in *both* branches.

Let me know if you run across it again.  Thanks!

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to