Interesting. The only difference I see is the FC11 - I haven't seen anyone running on that OS yet. I wonder if that is the source of the trouble? Do we know that our code works on that one? I know we had problems in the past with FC9, for example, that required fixes.
Also, what compiler are you using? I wonder if there is some optimization issue here, or some weird interaction between FC11 and the compiler. On Nov 30, 2009, at 8:48 AM, Sylvain Jeaugey wrote: > Hi Ralph, > > I'm also puzzled :-) > > Here is what I did today : > * download the latest nightly build (openmpi-1.7a1r22241) > * untar it > * patch it with my "ORTE_RELAY_DELAY" patch > * build it directly on the cluster (running FC11) with : > ./configure --platform=contrib/platform/lanl/tlcc/debug-nopanasas > --prefix=<some path in my home> > make && make install > > * deactivate oob_tcp_if_include=ib0 in openmpi-mca-params.conf (IPoIB is > broken on my machine) and run with : > salloc -N 10 mpirun ./helloworld > > And .. still the same behaviour : ok by default, deadlock with the typical > stack when setting ORTE_RELAY_DELAY to 1. > > About my previous e-mail, I was wrong about all components having a 0 > priority : it was based on default parameters reported by "ompi_info -a | > grep routed". It seems that the truth is not always in ompi_info ... > > Sylvain > > On Fri, 27 Nov 2009, Ralph Castain wrote: > >> >> On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote: >> >>> Hi Ralph, >>> >>> I tried with the trunk and it makes no difference for me. >> >> Strange >> >>> >>> Looking at potential differences, I found out something strange. The bug >>> may have something to do with the "routed" framework. I can reproduce the >>> bug with binomial and direct, but not with cm and linear (you disabled the >>> build of the latest in your configure options -- why ?). >> >> You won't with cm because there is no relay. Likewise, direct doesn't have a >> relay - so I'm really puzzled how you can see this behavior when using the >> direct component??? >> >> I disable components in my build to save memory. Every component we open >> costs us memory that may or may not be recoverable during the course of >> execution. >> >>> >>> Btw, all components have a 0 priority and none is defined to be the default >>> component. Which one is the default then ? binomial (as the first in >>> alphabetical order) ? >> >> I believe you must have a severely corrupted version of the code. The >> binomial component has priority 70 so it will be selected as the default. >> >> Linear has priority 40, though it will only be selected if you say ^binomial. >> >> CM and radix have special selection code in them so they will only be >> selected when specified. >> >> Direct and slave have priority 0 to ensure they will only be selected when >> specified >> >>> >>> Can you check which one you are using and try with binomial explicitely >>> chosen ? >> >> I am using binomial for all my tests >> >>> From what you are describing, I think you either have a corrupted copy of >>> the code, are picking up mis-matched versions, or something strange as your >>> experiences don't match what anyone else is seeing. >> >> Remember, the phase you are discussing here has nothing to do with the >> native launch environment. This is dealing with the relative timing of the >> application launch versus relaying the launch message itself - i.e., the >> daemons are already up and running before any of this starts. Thus, this >> "problem" has nothing to do with how we launch the daemons. So, if it truly >> were a problem in the code, we would see it on every environment - torque, >> slurm, ssh, etc. >> >> We routinely launch jobs spanning hundreds to thousands of nodes without >> problem. If this timing problem was as you have identified, then we would >> see this constantly. Yet nobody is seeing it, and I cannot reproduce it even >> with your reproducer. >> >> I honestly don't know what to suggest at this point. Any chance you are >> picking up mis-matched OMPI versions are your backend nodes or something? >> Tried fresh checkouts of the code? Is this a code base you have modified, or >> are you seeing this with the "stock" code from the repo? >> >> Just fishing at this point - can't find anything wrong! :-/ >> Ralph >> >> >>> >>> Thanks for your time, >>> Sylvain >>> >>> On Thu, 26 Nov 2009, Ralph Castain wrote: >>> >>>> Hi Sylvain >>>> >>>> Well, I hate to tell you this, but I cannot reproduce the "bug" even with >>>> this code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The >>>> system runs really slow as I increase the delay, but it completes the job >>>> just fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn, a >>>> "hello world" app that calls MPI_Init immediately upon execution. >>>> >>>> So I have to conclude this is a problem in your setup/config. Are you sure >>>> you didn't --enable-progress-threads?? That is the only way I can recreate >>>> this behavior. >>>> >>>> I plan to modify the relay/message processing method anyway to clean it >>>> up. But there doesn't appear to be anything wrong with the current code. >>>> Ralph >>>> >>>> On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: >>>> >>>>> Hi Ralph, >>>>> >>>>> Thanks for your efforts. I will look at our configuration and see how it >>>>> may differ from ours. >>>>> >>>>> Here is a patch which helps reproducing the bug even with a small number >>>>> of nodes. >>>>> >>>>> diff -r b622b9e8f1ac orte/orted/orted_comm.c >>>>> --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100 >>>>> +++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100 >>>>> @@ -126,6 +126,13 @@ >>>>> ORTE_ERROR_LOG(ret); >>>>> goto CLEANUP; >>>>> } >>>>> + { /* Add delay to reproduce bug */ >>>>> + char * str = getenv("ORTE_RELAY_DELAY"); >>>>> + int sec = str ? atoi(str) : 0; >>>>> + if (sec) { >>>>> + sleep(sec); >>>>> + } >>>>> + } >>>>> } >>>>> >>>>> CLEANUP: >>>>> >>>>> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug. >>>>> >>>>> During our experiments, the bug disappeared when we added a delay before >>>>> calling MPI_Init. So, configurations where processes are launched slowly >>>>> or take some time before MPI_Init should be immune to this bug. >>>>> >>>>> We usually reproduce the bug with one ppn (faster to spawn). >>>>> >>>>> Sylvain >>>>> >>>>> On Thu, 19 Nov 2009, Ralph Castain wrote: >>>>> >>>>>> Hi Sylvain >>>>>> >>>>>> I've spent several hours trying to replicate the behavior you described >>>>>> on clusters up to a couple of hundred nodes (all running slurm), without >>>>>> success. I'm becoming increasingly convinced that this is a >>>>>> configuration issue as opposed to a code issue. >>>>>> >>>>>> I have enclosed the platform file I use below. Could you compare it to >>>>>> your configuration? I'm wondering if there is something critical about >>>>>> the config that may be causing the problem (perhaps we have a problem in >>>>>> our default configuration). >>>>>> >>>>>> Also, is there anything else you can tell us about your configuration? >>>>>> How many ppn triggers it, or do you always get the behavior every time >>>>>> you launch over a certain number of nodes? >>>>>> >>>>>> Meantime, I will look into this further. I am going to introduce a "slow >>>>>> down" param that will force the situation you encountered - i.e., will >>>>>> ensure that the relay is still being sent when the daemon receives the >>>>>> first collective input. We can then use that to try and force >>>>>> replication of the behavior you are encountering. >>>>>> >>>>>> Thanks >>>>>> Ralph >>>>>> >>>>>> enable_dlopen=no >>>>>> enable_pty_support=no >>>>>> with_blcr=no >>>>>> with_openib=yes >>>>>> with_memory_manager=no >>>>>> enable_mem_debug=yes >>>>>> enable_mem_profile=no >>>>>> enable_debug_symbols=yes >>>>>> enable_binaries=yes >>>>>> with_devel_headers=yes >>>>>> enable_heterogeneous=no >>>>>> enable_picky=yes >>>>>> enable_debug=yes >>>>>> enable_shared=yes >>>>>> enable_static=yes >>>>>> with_slurm=yes >>>>>> enable_contrib_no_build=libnbc,vt >>>>>> enable_visibility=yes >>>>>> enable_memchecker=no >>>>>> enable_ipv6=no >>>>>> enable_mpi_f77=no >>>>>> enable_mpi_f90=no >>>>>> enable_mpi_cxx=no >>>>>> enable_mpi_cxx_seek=no >>>>>> enable_mca_no_build=pml-dr,pml-crcp2,crcp >>>>>> enable_io_romio=no >>>>>> >>>>>> On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote: >>>>>> >>>>>>> >>>>>>> On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: >>>>>>> >>>>>>>> Thank you Ralph for this precious help. >>>>>>>> >>>>>>>> I setup a quick-and-dirty patch basically postponing process_msg >>>>>>>> (hence daemon_collective) until the launch is done. In process_msg, I >>>>>>>> therefore requeue a process_msg handler and return. >>>>>>> >>>>>>> That is basically the idea I proposed, just done in a slightly >>>>>>> different place >>>>>>> >>>>>>>> >>>>>>>> In this "all-must-be-non-blocking-and-done-through-opal_progress" >>>>>>>> algorithm, I don't think that blocking calls like the one in >>>>>>>> daemon_collective should be allowed. This also applies to the blocking >>>>>>>> one in send_relay. [Well, actually, one is okay, 2 may lead to >>>>>>>> interlocking.] >>>>>>> >>>>>>> Well, that would be problematic - you will find "progressed_wait" used >>>>>>> repeatedly in the code. Removing them all would take a -lot- of effort >>>>>>> and a major rewrite. I'm not yet convinced it is required. There may be >>>>>>> something strange in how you are setup, or your cluster - like I said, >>>>>>> this is the first report of a problem we have had, and people with much >>>>>>> bigger slurm clusters have been running this code every day for over a >>>>>>> year. >>>>>>> >>>>>>>> >>>>>>>> If you have time doing a nicer patch, it would be great and I would be >>>>>>>> happy to test it. Otherwise, I will try to implement your idea >>>>>>>> properly next week (with my limited knowledge of orted). >>>>>>> >>>>>>> Either way is fine - I'll see if I can get to it. >>>>>>> >>>>>>> Thanks >>>>>>> Ralph >>>>>>> >>>>>>>> >>>>>>>> For the record, here is the patch I'm currently testing at large scale >>>>>>>> : >>>>>>>> >>>>>>>> diff -r ec68298b3169 -r b622b9e8f1ac >>>>>>>> orte/mca/grpcomm/bad/grpcomm_bad_module.c >>>>>>>> --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 >>>>>>>> 2009 +0100 >>>>>>>> +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 >>>>>>>> 2009 +0100 >>>>>>>> @@ -687,14 +687,6 @@ >>>>>>>> opal_list_append(&orte_local_jobdata, &jobdat->super); >>>>>>>> } >>>>>>>> >>>>>>>> - /* it may be possible to get here prior to having actually >>>>>>>> finished processing our >>>>>>>> - * local launch msg due to the race condition between different >>>>>>>> nodes and when >>>>>>>> - * they start their individual procs. Hence, we have to first >>>>>>>> ensure that we >>>>>>>> - * -have- finished processing the launch msg, or else we won't >>>>>>>> know whether >>>>>>>> - * or not to wait before sending this on >>>>>>>> - */ >>>>>>>> - ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1); >>>>>>>> - >>>>>>>> /* unpack the collective type */ >>>>>>>> n = 1; >>>>>>>> if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, >>>>>>>> &jobdat->collective_type, &n, ORTE_GRPCOMM_COLL_T))) { >>>>>>>> @@ -894,6 +886,28 @@ >>>>>>>> >>>>>>>> proc = &mev->sender; >>>>>>>> buf = mev->buffer; >>>>>>>> + >>>>>>>> + jobdat = NULL; >>>>>>>> + for (item = opal_list_get_first(&orte_local_jobdata); >>>>>>>> + item != opal_list_get_end(&orte_local_jobdata); >>>>>>>> + item = opal_list_get_next(item)) { >>>>>>>> + jobdat = (orte_odls_job_t*)item; >>>>>>>> + >>>>>>>> + /* is this the specified job? */ >>>>>>>> + if (jobdat->jobid == proc->jobid) { >>>>>>>> + break; >>>>>>>> + } >>>>>>>> + } >>>>>>>> + if (NULL == jobdat || jobdat->launch_msg_processed != 1) { >>>>>>>> + /* it may be possible to get here prior to having actually >>>>>>>> finished processing our >>>>>>>> + * local launch msg due to the race condition between >>>>>>>> different nodes and when >>>>>>>> + * they start their individual procs. Hence, we have to first >>>>>>>> ensure that we >>>>>>>> + * -have- finished processing the launch msg. Requeue this >>>>>>>> event until it is done. >>>>>>>> + */ >>>>>>>> + int tag = &mev->tag; >>>>>>>> + ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg); >>>>>>>> + return; >>>>>>>> + } >>>>>>>> >>>>>>>> /* is the sender a local proc, or a daemon relaying the collective? */ >>>>>>>> if (ORTE_PROC_MY_NAME->jobid == proc->jobid) { >>>>>>>> >>>>>>>> Sylvain >>>>>>>> >>>>>>>> On Thu, 19 Nov 2009, Ralph Castain wrote: >>>>>>>> >>>>>>>>> Very strange. As I said, we routinely launch jobs spanning several >>>>>>>>> hundred nodes without problem. You can see the platform files for >>>>>>>>> that setup in contrib/platform/lanl/tlcc >>>>>>>>> >>>>>>>>> That said, it is always possible you are hitting some kind of race >>>>>>>>> condition we don't hit. In looking at the code, one possibility would >>>>>>>>> be to make all the communications flow through the daemon cmd >>>>>>>>> processor in orte/orted_comm.c. This is the way it used to work until >>>>>>>>> I reorganized the code a year ago for other reasons that never >>>>>>>>> materialized. >>>>>>>>> >>>>>>>>> Unfortunately, the daemon collective has to wait until the local >>>>>>>>> launch cmd has been completely processed so it can know whether or >>>>>>>>> not to wait for contributions from local procs before sending along >>>>>>>>> the collective message, so this kinda limits our options. >>>>>>>>> >>>>>>>>> About the only other thing you could do would be to not send the >>>>>>>>> relay at all until -after- processing the local launch cmd. You can >>>>>>>>> then remove the "wait" in the daemon collective as you will know how >>>>>>>>> many local procs are involved, if any. >>>>>>>>> >>>>>>>>> I used to do it that way and it guarantees it will work. The negative >>>>>>>>> is that we lose some launch speed as the next nodes in the tree don't >>>>>>>>> get the launch message until this node finishes launching all its >>>>>>>>> procs. >>>>>>>>> >>>>>>>>> The way around that, of course, would be to: >>>>>>>>> >>>>>>>>> 1. process the launch message, thus extracting the number of any >>>>>>>>> local procs and setting up all data structures...but do -not- launch >>>>>>>>> the procs at this time (as this is what takes all the time) >>>>>>>>> >>>>>>>>> 2. send the relay - the daemon collective can now proceed without a >>>>>>>>> "wait" in it >>>>>>>>> >>>>>>>>> 3. now launch the local procs >>>>>>>>> >>>>>>>>> It would be a fairly simple reorganization of the code in the >>>>>>>>> orte/mca/odls area. I can do it this weekend if you like, or you can >>>>>>>>> do it - either way is fine, but if you do it, please contribute it >>>>>>>>> back to the trunk. >>>>>>>>> >>>>>>>>> Ralph >>>>>>>>> >>>>>>>>> >>>>>>>>> On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote: >>>>>>>>> >>>>>>>>>> I would say I use the default settings, i.e. I don't set anything >>>>>>>>>> "special" at configure. >>>>>>>>>> >>>>>>>>>> I'm launching my processes with SLURM (salloc + mpirun). >>>>>>>>>> >>>>>>>>>> Sylvain >>>>>>>>>> >>>>>>>>>> On Wed, 18 Nov 2009, Ralph Castain wrote: >>>>>>>>>> >>>>>>>>>>> How did you configure OMPI? >>>>>>>>>>> >>>>>>>>>>> What launch mechanism are you using - ssh? >>>>>>>>>>> >>>>>>>>>>> On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote: >>>>>>>>>>> >>>>>>>>>>>> I don't think so, and I'm not doing it explicitely at least. How >>>>>>>>>>>> do I know ? >>>>>>>>>>>> >>>>>>>>>>>> Sylvain >>>>>>>>>>>> >>>>>>>>>>>> On Tue, 17 Nov 2009, Ralph Castain wrote: >>>>>>>>>>>> >>>>>>>>>>>>> We routinely launch across thousands of nodes without a >>>>>>>>>>>>> problem...I have never seen it stick in this fashion. >>>>>>>>>>>>> >>>>>>>>>>>>> Did you build and/or are using ORTE threaded by any chance? If >>>>>>>>>>>>> so, that definitely won't work. >>>>>>>>>>>>> >>>>>>>>>>>>> On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> We are currently experiencing problems at launch on the 1.5 >>>>>>>>>>>>>> branch on relatively large number of nodes (at least 80). Some >>>>>>>>>>>>>> processes are not spawned and orted processes are deadlocked. >>>>>>>>>>>>>> >>>>>>>>>>>>>> When MPI processes are calling MPI_Init before send_relay is >>>>>>>>>>>>>> complete, the send_relay function and the daemon_collective >>>>>>>>>>>>>> function are doing a nice interlock : >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here is the scenario : >>>>>>>>>>>>>>> send_relay >>>>>>>>>>>>>> performs the send tree : >>>>>>>>>>>>>>> orte_rml_oob_send_buffer >>>>>>>>>>>>>>> orte_rml_oob_send >>>>>>>>>>>>>>> opal_wait_condition >>>>>>>>>>>>>> Waiting on completion from send thus calling opal_progress() >>>>>>>>>>>>>>> opal_progress() >>>>>>>>>>>>>> But since a collective request arrived from the network, entered >>>>>>>>>>>>>> : >>>>>>>>>>>>>>> daemon_collective >>>>>>>>>>>>>> However, daemon_collective is waiting for the job to be >>>>>>>>>>>>>> initialized (wait on jobdat->launch_msg_processed) before >>>>>>>>>>>>>> continuing, thus calling : >>>>>>>>>>>>>>> opal_progress() >>>>>>>>>>>>>> >>>>>>>>>>>>>> At this time, the send may complete, but since we will never go >>>>>>>>>>>>>> back to orte_rml_oob_send, we will never perform the launch >>>>>>>>>>>>>> (setting jobdat->launch_msg_processed to 1). >>>>>>>>>>>>>> >>>>>>>>>>>>>> I may try to solve the bug (this is quite a top priority problem >>>>>>>>>>>>>> for me), but maybe people who are more familiar with orted than >>>>>>>>>>>>>> I am may propose a nice and clean solution ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> For those who like real (and complete) gdb stacks, here they are >>>>>>>>>>>>>> : >>>>>>>>>>>>>> #0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6 >>>>>>>>>>>>>> #1 0x00007fd0de5d861a in poll_dispatch (base=0x930230, >>>>>>>>>>>>>> arg=0x91a4b0, tv=0x7fff0d977880) at poll.c:167 >>>>>>>>>>>>>> #2 0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, >>>>>>>>>>>>>> flags=1) at event.c:823 >>>>>>>>>>>>>> #3 0x00007fd0de5d556b in opal_event_loop (flags=1) at >>>>>>>>>>>>>> event.c:746 >>>>>>>>>>>>>> #4 0x00007fd0de5aeb6d in opal_progress () at >>>>>>>>>>>>>> runtime/opal_progress.c:189 >>>>>>>>>>>>>> #5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, >>>>>>>>>>>>>> data=0x97b010) at grpcomm_bad_module.c:696 >>>>>>>>>>>>>> #6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, >>>>>>>>>>>>>> data=0x97af20) at grpcomm_bad_module.c:901 >>>>>>>>>>>>>> #7 0x00007fd0de5d5334 in event_process_active (base=0x930230) >>>>>>>>>>>>>> at event.c:667 >>>>>>>>>>>>>> #8 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, >>>>>>>>>>>>>> flags=1) at event.c:839 >>>>>>>>>>>>>> #9 0x00007fd0de5d556b in opal_event_loop (flags=1) at >>>>>>>>>>>>>> event.c:746 >>>>>>>>>>>>>> #10 0x00007fd0de5aeb6d in opal_progress () at >>>>>>>>>>>>>> runtime/opal_progress.c:189 >>>>>>>>>>>>>> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, >>>>>>>>>>>>>> data=0x9676e0) at grpcomm_bad_module.c:696 >>>>>>>>>>>>>> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, >>>>>>>>>>>>>> data=0x9796d0) at grpcomm_bad_module.c:901 >>>>>>>>>>>>>> #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) >>>>>>>>>>>>>> at event.c:667 >>>>>>>>>>>>>> #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, >>>>>>>>>>>>>> flags=1) at event.c:839 >>>>>>>>>>>>>> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at >>>>>>>>>>>>>> event.c:746 >>>>>>>>>>>>>> #16 0x00007fd0de5aeb6d in opal_progress () at >>>>>>>>>>>>>> runtime/opal_progress.c:189 >>>>>>>>>>>>>> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, >>>>>>>>>>>>>> data=0x97b4e0) at grpcomm_bad_module.c:696 >>>>>>>>>>>>>> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, >>>>>>>>>>>>>> data=0x97b3f0) at grpcomm_bad_module.c:901 >>>>>>>>>>>>>> #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) >>>>>>>>>>>>>> at event.c:667 >>>>>>>>>>>>>> #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, >>>>>>>>>>>>>> flags=1) at event.c:839 >>>>>>>>>>>>>> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at >>>>>>>>>>>>>> event.c:746 >>>>>>>>>>>>>> #22 0x00007fd0de5aeb6d in opal_progress () at >>>>>>>>>>>>>> runtime/opal_progress.c:189 >>>>>>>>>>>>>> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, >>>>>>>>>>>>>> m=0x97b1a8) at ../../../../opal/threads/condition.h:99 >>>>>>>>>>>>>> #24 0x00007fd0dd96a4bf in orte_rml_oob_send >>>>>>>>>>>>>> (peer=0x7fff0d9785a0, iov=0x7fff0d978530, count=1, tag=1, >>>>>>>>>>>>>> flags=16) at rml_oob_send.c:153 >>>>>>>>>>>>>> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer >>>>>>>>>>>>>> (peer=0x7fff0d9785a0, buffer=0x7fff0d9786b0, tag=1, flags=0) at >>>>>>>>>>>>>> rml_oob_send.c:270 >>>>>>>>>>>>>> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at >>>>>>>>>>>>>> orted/orted_comm.c:127 >>>>>>>>>>>>>> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, >>>>>>>>>>>>>> opal_event=1, data=0x965fc0) at orted/orted_comm.c:308 >>>>>>>>>>>>>> #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) >>>>>>>>>>>>>> at event.c:667 >>>>>>>>>>>>>> #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, >>>>>>>>>>>>>> flags=0) at event.c:839 >>>>>>>>>>>>>> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at >>>>>>>>>>>>>> event.c:746 >>>>>>>>>>>>>> #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682 >>>>>>>>>>>>>> #32 0x00007fd0de86e339 in orte_daemon (argc=19, >>>>>>>>>>>>>> argv=0x7fff0d979ca8) at orted/orted_main.c:769 >>>>>>>>>>>>>> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at >>>>>>>>>>>>>> orted.c:62 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks in advance, >>>>>>>>>>>>>> Sylvain >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel