Okay - fixed with r18040 Thanks Ralph
On 3/31/08 11:01 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote: > > On Mar 31, 2008, at 12:57 PM, Ralph H Castain wrote: > >> >> >> >> On 3/31/08 9:28 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote: >> >>> At the moment I only use unity with C/R. Mostly because I have not >>> verified that the other components work properly under the C/R >>> conditions. I can verify others, but that doesn't solve the problem >>> with the unity component. :/ >>> >>> It is not critical that these jobs launch quickly, but that they >>> launch correctly for the moment. When you say 'slow the launch' are >>> you talking severely as in seconds/minutes for small nps? >> >> I didn't say "severely" - I said "measurably". ;-) >> >> It will require an additional communication to the daemons to let >> them know >> how to talk to the procs. In the current unity component, the >> daemons never >> talk to the procs themselves, and so they don't know contact info for >> rank=0. > > ah I see. > >> >> >>> I guess a >>> followup question is why did this component break in the first place? >>> or worded differently, what changed in ORTE such that the unity >>> component will suddenly deadlock when it didn't before? >> >> We are trying to improve scalability. Biggest issue is the modex, >> which we >> improved considerably by having the procs pass the modex info to the >> daemons, letting the daemons collect all modex info from procs on >> their >> node, and then having the daemons send that info along to the rank=0 >> proc >> for collection and xcast. >> >> Problem is that in the unity component, the local daemons don't know >> how to >> send the modex to the rank=0 proc. So what I will now have to do is >> tell all >> the daemons how to talk to the procs, and then we will have every >> daemon >> opening a socket to rank=0. That's where the time will be lost. >> >> Our original expectation was to get everyone off of unity as quickly >> as >> possible - in fact, Brian and I had planned to completely remove that >> component as quickly as possible as it (a) scales ugly and (b) gets >> in the >> way of things. Very hard to keep it alive. >> >> So for now, I'll just do the simple thing and hopefully that will be >> adequate - let me know if/when you are able to get C/R working on >> other >> routed components. > > Sounds good. I'll look into supporting the tree routed component, but > that will probably take a couple weeks. > > Thanks for the clarification. > > Cheers, > Josh > >> >> >> Thanks! >> Ralph >> >>> >>> Thanks for looking into this, >>> Josh >>> >>> On Mar 31, 2008, at 11:10 AM, Ralph H Castain wrote: >>> >>>> I figured out the issue - there is a simple and a hard way to fix >>>> this. So >>>> before I do, let me see what makes sense. >>>> >>>> The simple solution involves updating the daemons with contact info >>>> for the >>>> procs so that they can send their collected modex info to the rank=0 >>>> proc. >>>> This will measurably slow the launch when using unity. >>>> >>>> The hard solution is to do a hybrid routed approach whereby the >>>> daemons >>>> would route any daemon-to-proc communication while the procs >>>> continue to do >>>> direct proc-to-proc messaging. >>>> >>>> Is there some reason to be using the "unity" component? Do you care >>>> if jobs >>>> using unity launch slower? >>>> >>>> Thanks >>>> Ralph >>>> >>>> >>>> >>>> On 3/31/08 7:57 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote: >>>> >>>>> Ralph, >>>>> >>>>> I've just noticed that it seems that the 'unity' routed component >>>>> seems to be broken when using more than one machine. I'm using Odin >>>>> and r18028 of the trunk, and have confirmed that this problem >>>>> occurs >>>>> with SLURM and rsh. I think this break came in on Friday as that is >>>>> when some of my MTT tests started to hang and fail, but I cannot >>>>> point >>>>> to a specific revision at this point. The backtraces (enclosed) of >>>>> the >>>>> processes point to the grpcomm allgather routine. >>>>> >>>>> The 'noop' program calls MPI_Init, sleeps, then calls MPI_Finalize. >>>>> >>>>> RSH example from odin023 - so no SLURM variables: >>>>> These work: >>>>> shell$ mpirun -np 2 -host odin023 noop -v 1 >>>>> shell$ mpirun -np 2 -host odin023,odin024 noop -v 1 >>>>> shell$ mpirun -np 2 -mca routed unity -host odin023 noop -v 1 >>>>> >>>>> This hangs: >>>>> shell$ mpirun -np 2 -mca routed unity -host odin023,odin024 noop - >>>>> v 1 >>>>> >>>>> >>>>> If I attach to the 'noop' process on odin023 I get the following >>>>> backtrace: >>>>> ------------------------------------------------ >>>>> (gdb) bt >>>>> #0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6 >>>>> #1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b330, >>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61 >>>>> #2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506c30, >>>>> arg=0x506910, >>>>> tv=0x7fbfffe840) at epoll.c:210 >>>>> #3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506c30, >>>>> flags=5) at event.c:779 >>>>> #4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702 >>>>> #5 0x0000002a95a0bef8 in opal_progress () at runtime/ >>>>> opal_progress.c: >>>>> 169 >>>>> #6 0x0000002a958b9e48 in orte_grpcomm_base_allgather >>>>> (sbuf=0x7fbfffeae0, rbuf=0x7fbfffea80) at base/ >>>>> grpcomm_base_allgather.c:238 >>>>> #7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at >>>>> base/ >>>>> grpcomm_base_modex.c:413 >>>>> #8 0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffed58, >>>>> requested=0, provided=0x7fbfffec38) at runtime/ompi_mpi_init.c:510 >>>>> #9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffec7c, >>>>> argv=0x7fbfffec70) at pinit.c:88 >>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffed58) at >>>>> noop.c:39 >>>>> ------------------------------------------------ >>>>> >>>>> The 'noop' process on odin024 has a similar backtrace: >>>>> ------------------------------------------------ >>>>> (gdb) bt >>>>> #0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6 >>>>> #1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b390, >>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61 >>>>> #2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506cc0, >>>>> arg=0x506c20, >>>>> tv=0x7fbfffe9d0) at epoll.c:210 >>>>> #3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506cc0, >>>>> flags=5) at event.c:779 >>>>> #4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702 >>>>> #5 0x0000002a95a0bef8 in opal_progress () at runtime/ >>>>> opal_progress.c: >>>>> 169 >>>>> #6 0x0000002a958b97c5 in orte_grpcomm_base_allgather >>>>> (sbuf=0x7fbfffec70, rbuf=0x7fbfffec10) at base/ >>>>> grpcomm_base_allgather.c:163 >>>>> #7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at >>>>> base/ >>>>> grpcomm_base_modex.c:413 >>>>> #8 0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffeee8, >>>>> requested=0, provided=0x7fbfffedc8) at runtime/ompi_mpi_init.c:510 >>>>> #9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffee0c, >>>>> argv=0x7fbfffee00) at pinit.c:88 >>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffeee8) at >>>>> noop.c:39 >>>>> ------------------------------------------------ >>>>> >>>>> >>>>> Cheers, >>>>> Josh >>>> >>> >> >