Okay - fixed with r18040

Thanks
Ralph


On 3/31/08 11:01 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

> 
> On Mar 31, 2008, at 12:57 PM, Ralph H Castain wrote:
> 
>> 
>> 
>> 
>> On 3/31/08 9:28 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:
>> 
>>> At the moment I only use unity with C/R. Mostly because I have not
>>> verified that the other components work properly under the C/R
>>> conditions. I can verify others, but that doesn't solve the problem
>>> with the unity component. :/
>>> 
>>> It is not critical that these jobs launch quickly, but that they
>>> launch correctly for the moment. When you say 'slow the launch' are
>>> you talking severely as in seconds/minutes for small nps?
>> 
>> I didn't say "severely" - I said "measurably". ;-)
>> 
>> It will require an additional communication to the daemons to let
>> them know
>> how to talk to the procs. In the current unity component, the
>> daemons never
>> talk to the procs themselves, and so they don't know contact info for
>> rank=0.
> 
> ah I see.
> 
>> 
>> 
>>> I guess a
>>> followup question is why did this component break in the first place?
>>> or worded differently, what changed in ORTE such that the unity
>>> component will suddenly deadlock when it didn't before?
>> 
>> We are trying to improve scalability. Biggest issue is the modex,
>> which we
>> improved considerably by having the procs pass the modex info to the
>> daemons, letting the daemons collect all modex info from procs on
>> their
>> node, and then having the daemons send that info along to the rank=0
>> proc
>> for collection and xcast.
>> 
>> Problem is that in the unity component, the local daemons don't know
>> how to
>> send the modex to the rank=0 proc. So what I will now have to do is
>> tell all
>> the daemons how to talk to the procs, and then we will have every
>> daemon
>> opening a socket to rank=0. That's where the time will be lost.
>> 
>> Our original expectation was to get everyone off of unity as quickly
>> as
>> possible - in fact, Brian and I had planned to completely remove that
>> component as quickly as possible as it (a) scales ugly and (b) gets
>> in the
>> way of things. Very hard to keep it alive.
>> 
>> So for now, I'll just do the simple thing and hopefully that will be
>> adequate - let me know if/when you are able to get C/R working on
>> other
>> routed components.
> 
> Sounds good. I'll look into supporting the tree routed component, but
> that will probably take a couple weeks.
> 
> Thanks for the clarification.
> 
> Cheers,
> Josh
> 
>> 
>> 
>> Thanks!
>> Ralph
>> 
>>> 
>>> Thanks for looking into this,
>>> Josh
>>> 
>>> On Mar 31, 2008, at 11:10 AM, Ralph H Castain wrote:
>>> 
>>>> I figured out the issue - there is a simple and a hard way to fix
>>>> this. So
>>>> before I do, let me see what makes sense.
>>>> 
>>>> The simple solution involves updating the daemons with contact info
>>>> for the
>>>> procs so that they can send their collected modex info to the rank=0
>>>> proc.
>>>> This will measurably slow the launch when using unity.
>>>> 
>>>> The hard solution is to do a hybrid routed approach whereby the
>>>> daemons
>>>> would route any daemon-to-proc communication while the procs
>>>> continue to do
>>>> direct proc-to-proc messaging.
>>>> 
>>>> Is there some reason to be using the "unity" component? Do you care
>>>> if jobs
>>>> using unity launch slower?
>>>> 
>>>> Thanks
>>>> Ralph
>>>> 
>>>> 
>>>> 
>>>> On 3/31/08 7:57 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:
>>>> 
>>>>> Ralph,
>>>>> 
>>>>> I've just noticed that it seems that the 'unity' routed component
>>>>> seems to be broken when using more than one machine. I'm using Odin
>>>>> and r18028 of the trunk, and have confirmed that this problem
>>>>> occurs
>>>>> with SLURM and rsh. I think this break came in on Friday as that is
>>>>> when some of my MTT tests started to hang and fail, but I cannot
>>>>> point
>>>>> to a specific revision at this point. The backtraces (enclosed) of
>>>>> the
>>>>> processes point to the grpcomm allgather routine.
>>>>> 
>>>>> The 'noop' program calls MPI_Init, sleeps, then calls MPI_Finalize.
>>>>> 
>>>>> RSH example from odin023 - so no SLURM variables:
>>>>> These work:
>>>>> shell$ mpirun -np 2 -host odin023  noop -v 1
>>>>> shell$ mpirun -np 2 -host odin023,odin024  noop -v 1
>>>>> shell$ mpirun -np 2 -mca routed unity -host odin023  noop -v 1
>>>>> 
>>>>> This hangs:
>>>>> shell$ mpirun -np 2 -mca routed unity -host odin023,odin024  noop -
>>>>> v 1
>>>>> 
>>>>> 
>>>>> If I attach to the 'noop' process on odin023 I get the following
>>>>> backtrace:
>>>>> ------------------------------------------------
>>>>> (gdb) bt
>>>>> #0  0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
>>>>> #1  0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b330,
>>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61
>>>>> #2  0x0000002a95a1e7f7 in epoll_dispatch (base=0x506c30,
>>>>> arg=0x506910,
>>>>> tv=0x7fbfffe840) at epoll.c:210
>>>>> #3  0x0000002a95a1c057 in opal_event_base_loop (base=0x506c30,
>>>>> flags=5) at event.c:779
>>>>> #4  0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
>>>>> #5  0x0000002a95a0bef8 in opal_progress () at runtime/
>>>>> opal_progress.c:
>>>>> 169
>>>>> #6  0x0000002a958b9e48 in orte_grpcomm_base_allgather
>>>>> (sbuf=0x7fbfffeae0, rbuf=0x7fbfffea80) at base/
>>>>> grpcomm_base_allgather.c:238
>>>>> #7  0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
>>>>> base/
>>>>> grpcomm_base_modex.c:413
>>>>> #8  0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffed58,
>>>>> requested=0, provided=0x7fbfffec38) at runtime/ompi_mpi_init.c:510
>>>>> #9  0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffec7c,
>>>>> argv=0x7fbfffec70) at pinit.c:88
>>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffed58) at
>>>>> noop.c:39
>>>>> ------------------------------------------------
>>>>> 
>>>>> The 'noop' process on odin024 has a similar backtrace:
>>>>> ------------------------------------------------
>>>>> (gdb) bt
>>>>> #0  0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
>>>>> #1  0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b390,
>>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61
>>>>> #2  0x0000002a95a1e7f7 in epoll_dispatch (base=0x506cc0,
>>>>> arg=0x506c20,
>>>>> tv=0x7fbfffe9d0) at epoll.c:210
>>>>> #3  0x0000002a95a1c057 in opal_event_base_loop (base=0x506cc0,
>>>>> flags=5) at event.c:779
>>>>> #4  0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
>>>>> #5  0x0000002a95a0bef8 in opal_progress () at runtime/
>>>>> opal_progress.c:
>>>>> 169
>>>>> #6  0x0000002a958b97c5 in orte_grpcomm_base_allgather
>>>>> (sbuf=0x7fbfffec70, rbuf=0x7fbfffec10) at base/
>>>>> grpcomm_base_allgather.c:163
>>>>> #7  0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
>>>>> base/
>>>>> grpcomm_base_modex.c:413
>>>>> #8  0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffeee8,
>>>>> requested=0, provided=0x7fbfffedc8) at runtime/ompi_mpi_init.c:510
>>>>> #9  0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffee0c,
>>>>> argv=0x7fbfffee00) at pinit.c:88
>>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffeee8) at
>>>>> noop.c:39
>>>>> ------------------------------------------------
>>>>> 
>>>>> 
>>>>> Cheers,
>>>>> Josh
>>>> 
>>> 
>> 
> 


Reply via email to