On Mar 31, 2008, at 12:57 PM, Ralph H Castain wrote:




On 3/31/08 9:28 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

At the moment I only use unity with C/R. Mostly because I have not
verified that the other components work properly under the C/R
conditions. I can verify others, but that doesn't solve the problem
with the unity component. :/

It is not critical that these jobs launch quickly, but that they
launch correctly for the moment. When you say 'slow the launch' are
you talking severely as in seconds/minutes for small nps?

I didn't say "severely" - I said "measurably". ;-)

It will require an additional communication to the daemons to let them know how to talk to the procs. In the current unity component, the daemons never
talk to the procs themselves, and so they don't know contact info for
rank=0.

ah I see.



I guess a
followup question is why did this component break in the first place?
or worded differently, what changed in ORTE such that the unity
component will suddenly deadlock when it didn't before?

We are trying to improve scalability. Biggest issue is the modex, which we
improved considerably by having the procs pass the modex info to the
daemons, letting the daemons collect all modex info from procs on their node, and then having the daemons send that info along to the rank=0 proc
for collection and xcast.

Problem is that in the unity component, the local daemons don't know how to send the modex to the rank=0 proc. So what I will now have to do is tell all the daemons how to talk to the procs, and then we will have every daemon
opening a socket to rank=0. That's where the time will be lost.

Our original expectation was to get everyone off of unity as quickly as
possible - in fact, Brian and I had planned to completely remove that
component as quickly as possible as it (a) scales ugly and (b) gets in the
way of things. Very hard to keep it alive.

So for now, I'll just do the simple thing and hopefully that will be
adequate - let me know if/when you are able to get C/R working on other
routed components.

Sounds good. I'll look into supporting the tree routed component, but that will probably take a couple weeks.

Thanks for the clarification.

Cheers,
Josh



Thanks!
Ralph


Thanks for looking into this,
Josh

On Mar 31, 2008, at 11:10 AM, Ralph H Castain wrote:

I figured out the issue - there is a simple and a hard way to fix
this. So
before I do, let me see what makes sense.

The simple solution involves updating the daemons with contact info
for the
procs so that they can send their collected modex info to the rank=0
proc.
This will measurably slow the launch when using unity.

The hard solution is to do a hybrid routed approach whereby the
daemons
would route any daemon-to-proc communication while the procs
continue to do
direct proc-to-proc messaging.

Is there some reason to be using the "unity" component? Do you care
if jobs
using unity launch slower?

Thanks
Ralph



On 3/31/08 7:57 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

Ralph,

I've just noticed that it seems that the 'unity' routed component
seems to be broken when using more than one machine. I'm using Odin
and r18028 of the trunk, and have confirmed that this problem occurs
with SLURM and rsh. I think this break came in on Friday as that is
when some of my MTT tests started to hang and fail, but I cannot
point
to a specific revision at this point. The backtraces (enclosed) of
the
processes point to the grpcomm allgather routine.

The 'noop' program calls MPI_Init, sleeps, then calls MPI_Finalize.

RSH example from odin023 - so no SLURM variables:
These work:
shell$ mpirun -np 2 -host odin023  noop -v 1
shell$ mpirun -np 2 -host odin023,odin024  noop -v 1
shell$ mpirun -np 2 -mca routed unity -host odin023  noop -v 1

This hangs:
shell$ mpirun -np 2 -mca routed unity -host odin023,odin024  noop -
v 1


If I attach to the 'noop' process on odin023 I get the following
backtrace:
------------------------------------------------
(gdb) bt
#0  0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
#1  0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b330,
maxevents=1023, timeout=1000) at epoll_sub.c:61
#2  0x0000002a95a1e7f7 in epoll_dispatch (base=0x506c30,
arg=0x506910,
tv=0x7fbfffe840) at epoll.c:210
#3  0x0000002a95a1c057 in opal_event_base_loop (base=0x506c30,
flags=5) at event.c:779
#4  0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
#5  0x0000002a95a0bef8 in opal_progress () at runtime/
opal_progress.c:
169
#6  0x0000002a958b9e48 in orte_grpcomm_base_allgather
(sbuf=0x7fbfffeae0, rbuf=0x7fbfffea80) at base/
grpcomm_base_allgather.c:238
#7  0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
base/
grpcomm_base_modex.c:413
#8  0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffed58,
requested=0, provided=0x7fbfffec38) at runtime/ompi_mpi_init.c:510
#9  0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffec7c,
argv=0x7fbfffec70) at pinit.c:88
#10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffed58) at
noop.c:39
------------------------------------------------

The 'noop' process on odin024 has a similar backtrace:
------------------------------------------------
(gdb) bt
#0  0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
#1  0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b390,
maxevents=1023, timeout=1000) at epoll_sub.c:61
#2  0x0000002a95a1e7f7 in epoll_dispatch (base=0x506cc0,
arg=0x506c20,
tv=0x7fbfffe9d0) at epoll.c:210
#3  0x0000002a95a1c057 in opal_event_base_loop (base=0x506cc0,
flags=5) at event.c:779
#4  0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
#5  0x0000002a95a0bef8 in opal_progress () at runtime/
opal_progress.c:
169
#6  0x0000002a958b97c5 in orte_grpcomm_base_allgather
(sbuf=0x7fbfffec70, rbuf=0x7fbfffec10) at base/
grpcomm_base_allgather.c:163
#7  0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
base/
grpcomm_base_modex.c:413
#8  0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffeee8,
requested=0, provided=0x7fbfffedc8) at runtime/ompi_mpi_init.c:510
#9  0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffee0c,
argv=0x7fbfffee00) at pinit.c:88
#10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffeee8) at
noop.c:39
------------------------------------------------


Cheers,
Josh




Reply via email to