date:20091127

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-27 Thread Sylvain Jeaugey


Hi Ralph,

I tried with the trunk and it makes no difference for me.

Looking at potential differences, I found out something strange. The bug 
may have something to do with the "routed" framework. I can reproduce the 
bug with binomial and direct, but not with cm and linear (you disabled the 
build of the latest in your configure options -- why ?).


Btw, all components have a 0 priority and none is defined to be the 
default component. Which one is the default then ? binomial (as the first 
in alphabetical order) ?


Can you check which one you are using and try with binomial explicitely 
chosen ?


Thanks for your time,
Sylvain

On Thu, 26 Nov 2009, Ralph Castain wrote:


Hi Sylvain

Well, I hate to tell you this, but I cannot reproduce the "bug" even with this code in 
ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs really slow as I increase the 
delay, but it completes the job just fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn, 
a "hello world" app that calls MPI_Init immediately upon execution.

So I have to conclude this is a problem in your setup/config. Are you sure you 
didn't --enable-progress-threads?? That is the only way I can recreate this 
behavior.

I plan to modify the relay/message processing method anyway to clean it up. But 
there doesn't appear to be anything wrong with the current code.
Ralph

On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:


Hi Ralph,

Thanks for your efforts. I will look at our configuration and see how it may 
differ from ours.

Here is a patch which helps reproducing the bug even with a small number of 
nodes.

diff -r b622b9e8f1ac orte/orted/orted_comm.c
--- a/orte/orted/orted_comm.c   Wed Nov 18 09:27:55 2009 +0100
+++ b/orte/orted/orted_comm.c   Fri Nov 20 14:47:39 2009 +0100
@@ -126,6 +126,13 @@
ORTE_ERROR_LOG(ret);
goto CLEANUP;
}
+{ /* Add delay to reproduce bug */
+char * str = getenv("ORTE_RELAY_DELAY");
+int sec = str ? atoi(str) : 0;
+if (sec) {
+sleep(sec);
+}
+}
}

CLEANUP:

Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.

During our experiments, the bug disappeared when we added a delay before 
calling MPI_Init. So, configurations where processes are launched slowly or 
take some time before MPI_Init should be immune to this bug.

We usually reproduce the bug with one ppn (faster to spawn).

Sylvain

On Thu, 19 Nov 2009, Ralph Castain wrote:


Hi Sylvain

I've spent several hours trying to replicate the behavior you described on 
clusters up to a couple of hundred nodes (all running slurm), without success. 
I'm becoming increasingly convinced that this is a configuration issue as 
opposed to a code issue.

I have enclosed the platform file I use below. Could you compare it to your 
configuration? I'm wondering if there is something critical about the config 
that may be causing the problem (perhaps we have a problem in our default 
configuration).

Also, is there anything else you can tell us about your configuration? How many 
ppn triggers it, or do you always get the behavior every time you launch over a 
certain number of nodes?

Meantime, I will look into this further. I am going to introduce a "slow down" 
param that will force the situation you encountered - i.e., will ensure that the relay is 
still being sent when the daemon receives the first collective input. We can then use 
that to try and force replication of the behavior you are encountering.

Thanks
Ralph

enable_dlopen=no
enable_pty_support=no
with_blcr=no
with_openib=yes
with_memory_manager=no
enable_mem_debug=yes
enable_mem_profile=no
enable_debug_symbols=yes
enable_binaries=yes
with_devel_headers=yes
enable_heterogeneous=no
enable_picky=yes
enable_debug=yes
enable_shared=yes
enable_static=yes
with_slurm=yes
enable_contrib_no_build=libnbc,vt
enable_visibility=yes
enable_memchecker=no
enable_ipv6=no
enable_mpi_f77=no
enable_mpi_f90=no
enable_mpi_cxx=no
enable_mpi_cxx_seek=no
enable_mca_no_build=pml-dr,pml-crcp2,crcp
enable_io_romio=no

On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:



On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:


Thank you Ralph for this precious help.

I setup a quick-and-dirty patch basically postponing process_msg (hence 
daemon_collective) until the launch is done. In process_msg, I therefore 
requeue a process_msg handler and return.


That is basically the idea I proposed, just done in a slightly different place



In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, I 
don't think that blocking calls like the one in daemon_collective should be allowed. This 
also applies to the blocking one in send_relay. [Well, actually, one is okay, 2 may lead 
to interlocking.]


Well, that would be problematic - you will find "progressed_wait" used 
repeatedly in the code. Removing them all would take a -lot- of effort and a major 
rewrite. I'm not yet convi

Re: [OMPI devel] SC09 OMPI-related slides

2009-11-27 Thread Kenneth Lloyd

Jeff,

Thanks for making these papers and presentations available.

Ken


Kenneth A. Lloyd
CEO - Director of Systems Science
Watt Systems Technologies Inc.
Albuquerque, NM USA

kenneth.lloyd[at]wattsys.com
kenneth.lloyd[at]incose.org 
kenneth.lloyd[at]nmug.net  
www.wattsys.com

http://www.linkedin.com/pub/kenneth-lloyd/7/9a/824

http://kenscomplex.blogspot.com/

This e-mail is covered by the Electronic Communications Privacy Act, 18
U.S.C. 2510-2521 
and is intended only for the addressee named above. It may contain
privileged or confidential information.
If you are not the addressee you must not copy, distribute, disclose or use
any of the information in it. 
If you have received it in error please delete it and immediately notify the
sender.


> -Original Message-
> From: devel-boun...@open-mpi.org 
> [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres
> Sent: Tuesday, November 24, 2009 6:08 AM
> To: Open MPI Developers List
> Subject: [OMPI devel] SC09 OMPI-related slides
> 
> If you had any papers or presentations on, about, or relating 
> to Open MPI, please send me a copy so that we can post them here:
> 
>  http://www.open-mpi.org/papers/sc-2009/
> 
> Thanks!
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-27 Thread Ralph Castain

On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote:

> Hi Ralph,
> 
> I tried with the trunk and it makes no difference for me.

Strange

> 
> Looking at potential differences, I found out something strange. The bug may 
> have something to do with the "routed" framework. I can reproduce the bug 
> with binomial and direct, but not with cm and linear (you disabled the build 
> of the latest in your configure options -- why ?).

You won't with cm because there is no relay. Likewise, direct doesn't have a 
relay - so I'm really puzzled how you can see this behavior when using the 
direct component???

I disable components in my build to save memory. Every component we open costs 
us memory that may or may not be recoverable during the course of execution.

> 
> Btw, all components have a 0 priority and none is defined to be the default 
> component. Which one is the default then ? binomial (as the first in 
> alphabetical order) ?

I believe you must have a severely corrupted version of the code. The binomial 
component has priority 70 so it will be selected as the default.

Linear has priority 40, though it will only be selected if you say ^binomial.

CM and radix have special selection code in them so they will only be selected 
when specified.

Direct and slave have priority 0 to ensure they will only be selected when 
specified

> 
> Can you check which one you are using and try with binomial explicitely 
> chosen ?

I am using binomial for all my tests

>From what you are describing, I think you either have a corrupted copy of the 
>code, are picking up mis-matched versions, or something strange as your 
>experiences don't match what anyone else is seeing.

Remember, the phase you are discussing here has nothing to do with the native 
launch environment. This is dealing with the relative timing of the application 
launch versus relaying the launch message itself - i.e., the daemons are 
already up and running before any of this starts. Thus, this "problem" has 
nothing to do with how we launch the daemons. So, if it truly were a problem in 
the code, we would see it on every environment - torque, slurm, ssh, etc.

We routinely launch jobs spanning hundreds to thousands of nodes without 
problem. If this timing problem was as you have identified, then we would see 
this constantly. Yet nobody is seeing it, and I cannot reproduce it even with 
your reproducer.

I honestly don't know what to suggest at this point. Any chance you are picking 
up mis-matched OMPI versions are your backend nodes or something? Tried fresh 
checkouts of the code? Is this a code base you have modified, or are you seeing 
this with the "stock" code from the repo?

Just fishing at this point - can't find anything wrong! :-/
Ralph

> 
> Thanks for your time,
> Sylvain
> 
> On Thu, 26 Nov 2009, Ralph Castain wrote:
> 
>> Hi Sylvain
>> 
>> Well, I hate to tell you this, but I cannot reproduce the "bug" even with 
>> this code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The system 
>> runs really slow as I increase the delay, but it completes the job just 
>> fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn, a "hello 
>> world" app that calls MPI_Init immediately upon execution.
>> 
>> So I have to conclude this is a problem in your setup/config. Are you sure 
>> you didn't --enable-progress-threads?? That is the only way I can recreate 
>> this behavior.
>> 
>> I plan to modify the relay/message processing method anyway to clean it up. 
>> But there doesn't appear to be anything wrong with the current code.
>> Ralph
>> 
>> On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:
>> 
>>> Hi Ralph,
>>> 
>>> Thanks for your efforts. I will look at our configuration and see how it 
>>> may differ from ours.
>>> 
>>> Here is a patch which helps reproducing the bug even with a small number of 
>>> nodes.
>>> 
>>> diff -r b622b9e8f1ac orte/orted/orted_comm.c
>>> --- a/orte/orted/orted_comm.c   Wed Nov 18 09:27:55 2009 +0100
>>> +++ b/orte/orted/orted_comm.c   Fri Nov 20 14:47:39 2009 +0100
>>> @@ -126,6 +126,13 @@
>>>ORTE_ERROR_LOG(ret);
>>>goto CLEANUP;
>>>}
>>> +{ /* Add delay to reproduce bug */
>>> +char * str = getenv("ORTE_RELAY_DELAY");
>>> +int sec = str ? atoi(str) : 0;
>>> +if (sec) {
>>> +sleep(sec);
>>> +}
>>> +}
>>>}
>>> 
>>> CLEANUP:
>>> 
>>> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.
>>> 
>>> During our experiments, the bug disappeared when we added a delay before 
>>> calling MPI_Init. So, configurations where processes are launched slowly or 
>>> take some time before MPI_Init should be immune to this bug.
>>> 
>>> We usually reproduce the bug with one ppn (faster to spawn).
>>> 
>>> Sylvain
>>> 
>>> On Thu, 19 Nov 2009, Ralph Castain wrote:
>>> 
 Hi Sylvain

 I've spent several hours trying to replicate the behavior you described on 
 cluste

Re: [OMPI devel] RFC: Add extra_state field to ompi_request_t

2009-11-27 Thread George Bosilca

Brian,

This is a pretty big change to be done with a so short notice, especially over 
the Thanksgiving weekend. I do have a lots of concerns about this approach, but 
I lack the time to expand on this right now. I'll be back at work on Monday and 
I'll give detailed informations. Please delay the deadline until at least 
Wednesday.

  Thanks,
george.

On Nov 25, 2009, at 11:52 , Barrett, Brian W wrote:

> WHAT: Add a void* extra_state field to ompi_request_t
> 
> WHY: When we added the req_complete_cb field so that internal pieces of OMPI
> who generated requests (such as the OSC components using the PML) could be
> async notified when the request completed (ie, the PML request the OSC
> component had initiated was finished), we neglected to add any type of
> "extra state" associated with that request/callback.  So the completion
> callback is almost worthless, because the upper layer has a hard time
> figuring out which thing it was working on it can now progress due to the
> given (lower?) request completing.
> 
> WHERE: One line in each of ompi/request/request.[hc].
> 
> WHEN: ASAP
> 
> TIMEOUT: Sunday, Nov 29.
> 
> More Details
> 
> 
> This is probably not even worth an RFC, which is why I'm not giving a very
> long timeout (that, and if I don't get this done during the holiday weekend,
> it will never get done).  The changes are a single line in request.h adding
> a void* extra_state variable to the ompi_request_t and another single line
> in request.c to initialize the field to NULL.
> 
> While looking for some other code, I stumbled upon the OSC changes I made a
> long time ago to try to use req_complete_cb instead of registering a
> progress function.  The code is actually a lot cleaner that way, and means
> no progress functions for the one-side components.
> 
> The down side is that it adds another 8 bytes to ompi_request_t, which is
> already larger than I'd like.  But on the flip side, we have an 8 byte field
> (the callback) which is totally unusable without the extra_state field.
> 
> Brian
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

Re: [OMPI devel] SC09 OMPI-related slides

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

Re: [OMPI devel] RFC: Add extra_state field to ompi_request_t

4 matches

Site Navigation

Mail list logo

Footer information