Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Sylvain Jeaugey

Hi Ralph,

Thanks for your efforts. I will look at our configuration and see how it 
may differ from ours.


Here is a patch which helps reproducing the bug even with a small number 
of nodes.


diff -r b622b9e8f1ac orte/orted/orted_comm.c
--- a/orte/orted/orted_comm.c   Wed Nov 18 09:27:55 2009 +0100
+++ b/orte/orted/orted_comm.c   Fri Nov 20 14:47:39 2009 +0100
@@ -126,6 +126,13 @@
 ORTE_ERROR_LOG(ret);
 goto CLEANUP;
 }
+{ /* Add delay to reproduce bug */
+char * str = getenv("ORTE_RELAY_DELAY");
+int sec = str ? atoi(str) : 0;
+if (sec) {
+sleep(sec);
+}
+}
 }

 CLEANUP:

Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.

During our experiments, the bug disappeared when we added a delay before 
calling MPI_Init. So, configurations where processes are launched slowly 
or take some time before MPI_Init should be immune to this bug.


We usually reproduce the bug with one ppn (faster to spawn).

Sylvain

On Thu, 19 Nov 2009, Ralph Castain wrote:


Hi Sylvain

I've spent several hours trying to replicate the behavior you described on 
clusters up to a couple of hundred nodes (all running slurm), without success. 
I'm becoming increasingly convinced that this is a configuration issue as 
opposed to a code issue.

I have enclosed the platform file I use below. Could you compare it to your 
configuration? I'm wondering if there is something critical about the config 
that may be causing the problem (perhaps we have a problem in our default 
configuration).

Also, is there anything else you can tell us about your configuration? How many 
ppn triggers it, or do you always get the behavior every time you launch over a 
certain number of nodes?

Meantime, I will look into this further. I am going to introduce a "slow down" 
param that will force the situation you encountered - i.e., will ensure that the relay is 
still being sent when the daemon receives the first collective input. We can then use 
that to try and force replication of the behavior you are encountering.

Thanks
Ralph

enable_dlopen=no
enable_pty_support=no
with_blcr=no
with_openib=yes
with_memory_manager=no
enable_mem_debug=yes
enable_mem_profile=no
enable_debug_symbols=yes
enable_binaries=yes
with_devel_headers=yes
enable_heterogeneous=no
enable_picky=yes
enable_debug=yes
enable_shared=yes
enable_static=yes
with_slurm=yes
enable_contrib_no_build=libnbc,vt
enable_visibility=yes
enable_memchecker=no
enable_ipv6=no
enable_mpi_f77=no
enable_mpi_f90=no
enable_mpi_cxx=no
enable_mpi_cxx_seek=no
enable_mca_no_build=pml-dr,pml-crcp2,crcp
enable_io_romio=no

On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:



On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:


Thank you Ralph for this precious help.

I setup a quick-and-dirty patch basically postponing process_msg (hence 
daemon_collective) until the launch is done. In process_msg, I therefore 
requeue a process_msg handler and return.


That is basically the idea I proposed, just done in a slightly different place



In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, I 
don't think that blocking calls like the one in daemon_collective should be allowed. This 
also applies to the blocking one in send_relay. [Well, actually, one is okay, 2 may lead 
to interlocking.]


Well, that would be problematic - you will find "progressed_wait" used 
repeatedly in the code. Removing them all would take a -lot- of effort and a major 
rewrite. I'm not yet convinced it is required. There may be something strange in how you 
are setup, or your cluster - like I said, this is the first report of a problem we have 
had, and people with much bigger slurm clusters have been running this code every day for 
over a year.



If you have time doing a nicer patch, it would be great and I would be happy to 
test it. Otherwise, I will try to implement your idea properly next week (with 
my limited knowledge of orted).


Either way is fine - I'll see if I can get to it.

Thanks
Ralph



For the record, here is the patch I'm currently testing at large scale :

diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c
--- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100
+++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100
@@ -687,14 +687,6 @@
   opal_list_append(&orte_local_jobdata, &jobdat->super);
   }

-/* it may be possible to get here prior to having actually finished 
processing our
- * local launch msg due to the race condition between different nodes and 
when
- * they start their individual procs. Hence, we have to first ensure that 
we
- * -have- finished processing the launch msg, or else we won't know whether
- * or not to wait before sending this on
- */
-ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);
-
   /* unpack the collective type */
   n

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Ralph Castain
Thanks! I'll give it a try.

My tests are all conducted with fast launches (just running slurm on large 
clusters) and using an mpi hello world that calls mpi_init at first 
instruction. I'll see if adding the delay causes it to misbehave.


On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:

> Hi Ralph,
> 
> Thanks for your efforts. I will look at our configuration and see how it may 
> differ from ours.
> 
> Here is a patch which helps reproducing the bug even with a small number of 
> nodes.
> 
> diff -r b622b9e8f1ac orte/orted/orted_comm.c
> --- a/orte/orted/orted_comm.c   Wed Nov 18 09:27:55 2009 +0100
> +++ b/orte/orted/orted_comm.c   Fri Nov 20 14:47:39 2009 +0100
> @@ -126,6 +126,13 @@
> ORTE_ERROR_LOG(ret);
> goto CLEANUP;
> }
> +{ /* Add delay to reproduce bug */
> +char * str = getenv("ORTE_RELAY_DELAY");
> +int sec = str ? atoi(str) : 0;
> +if (sec) {
> +sleep(sec);
> +}
> +}
> }
> 
> CLEANUP:
> 
> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.
> 
> During our experiments, the bug disappeared when we added a delay before 
> calling MPI_Init. So, configurations where processes are launched slowly or 
> take some time before MPI_Init should be immune to this bug.
> 
> We usually reproduce the bug with one ppn (faster to spawn).
> 
> Sylvain
> 
> On Thu, 19 Nov 2009, Ralph Castain wrote:
> 
>> Hi Sylvain
>> 
>> I've spent several hours trying to replicate the behavior you described on 
>> clusters up to a couple of hundred nodes (all running slurm), without 
>> success. I'm becoming increasingly convinced that this is a configuration 
>> issue as opposed to a code issue.
>> 
>> I have enclosed the platform file I use below. Could you compare it to your 
>> configuration? I'm wondering if there is something critical about the config 
>> that may be causing the problem (perhaps we have a problem in our default 
>> configuration).
>> 
>> Also, is there anything else you can tell us about your configuration? How 
>> many ppn triggers it, or do you always get the behavior every time you 
>> launch over a certain number of nodes?
>> 
>> Meantime, I will look into this further. I am going to introduce a "slow 
>> down" param that will force the situation you encountered - i.e., will 
>> ensure that the relay is still being sent when the daemon receives the first 
>> collective input. We can then use that to try and force replication of the 
>> behavior you are encountering.
>> 
>> Thanks
>> Ralph
>> 
>> enable_dlopen=no
>> enable_pty_support=no
>> with_blcr=no
>> with_openib=yes
>> with_memory_manager=no
>> enable_mem_debug=yes
>> enable_mem_profile=no
>> enable_debug_symbols=yes
>> enable_binaries=yes
>> with_devel_headers=yes
>> enable_heterogeneous=no
>> enable_picky=yes
>> enable_debug=yes
>> enable_shared=yes
>> enable_static=yes
>> with_slurm=yes
>> enable_contrib_no_build=libnbc,vt
>> enable_visibility=yes
>> enable_memchecker=no
>> enable_ipv6=no
>> enable_mpi_f77=no
>> enable_mpi_f90=no
>> enable_mpi_cxx=no
>> enable_mpi_cxx_seek=no
>> enable_mca_no_build=pml-dr,pml-crcp2,crcp
>> enable_io_romio=no
>> 
>> On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:
>> 
>>> 
>>> On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:
>>> 
 Thank you Ralph for this precious help.
 
 I setup a quick-and-dirty patch basically postponing process_msg (hence 
 daemon_collective) until the launch is done. In process_msg, I therefore 
 requeue a process_msg handler and return.
>>> 
>>> That is basically the idea I proposed, just done in a slightly different 
>>> place
>>> 
 
 In this "all-must-be-non-blocking-and-done-through-opal_progress" 
 algorithm, I don't think that blocking calls like the one in 
 daemon_collective should be allowed. This also applies to the blocking one 
 in send_relay. [Well, actually, one is okay, 2 may lead to interlocking.]
>>> 
>>> Well, that would be problematic - you will find "progressed_wait" used 
>>> repeatedly in the code. Removing them all would take a -lot- of effort and 
>>> a major rewrite. I'm not yet convinced it is required. There may be 
>>> something strange in how you are setup, or your cluster - like I said, this 
>>> is the first report of a problem we have had, and people with much bigger 
>>> slurm clusters have been running this code every day for over a year.
>>> 
 
 If you have time doing a nicer patch, it would be great and I would be 
 happy to test it. Otherwise, I will try to implement your idea properly 
 next week (with my limited knowledge of orted).
>>> 
>>> Either way is fine - I'll see if I can get to it.
>>> 
>>> Thanks
>>> Ralph
>>> 
 
 For the record, here is the patch I'm currently testing at large scale :
 
 diff -r ec68298b3169 -r b622b9e8f1ac 
 orte/mca/grpcomm/bad/grpcomm_bad_module.c
 --- a/orte/mca/grpcomm/bad/grpcomm_bad_modu

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Ralph Castain
BTW: does this reproduce on the trunk and/or 1.3.4 as well? I'm wondering 
because we know the 1.5 branch is skewed relative to the trunk. Could well be a 
bug sitting over there.

On Nov 20, 2009, at 7:06 AM, Ralph Castain wrote:

> Thanks! I'll give it a try.
> 
> My tests are all conducted with fast launches (just running slurm on large 
> clusters) and using an mpi hello world that calls mpi_init at first 
> instruction. I'll see if adding the delay causes it to misbehave.
> 
> 
> On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:
> 
>> Hi Ralph,
>> 
>> Thanks for your efforts. I will look at our configuration and see how it may 
>> differ from ours.
>> 
>> Here is a patch which helps reproducing the bug even with a small number of 
>> nodes.
>> 
>> diff -r b622b9e8f1ac orte/orted/orted_comm.c
>> --- a/orte/orted/orted_comm.c   Wed Nov 18 09:27:55 2009 +0100
>> +++ b/orte/orted/orted_comm.c   Fri Nov 20 14:47:39 2009 +0100
>> @@ -126,6 +126,13 @@
>>ORTE_ERROR_LOG(ret);
>>goto CLEANUP;
>>}
>> +{ /* Add delay to reproduce bug */
>> +char * str = getenv("ORTE_RELAY_DELAY");
>> +int sec = str ? atoi(str) : 0;
>> +if (sec) {
>> +sleep(sec);
>> +}
>> +}
>>}
>> 
>> CLEANUP:
>> 
>> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.
>> 
>> During our experiments, the bug disappeared when we added a delay before 
>> calling MPI_Init. So, configurations where processes are launched slowly or 
>> take some time before MPI_Init should be immune to this bug.
>> 
>> We usually reproduce the bug with one ppn (faster to spawn).
>> 
>> Sylvain
>> 
>> On Thu, 19 Nov 2009, Ralph Castain wrote:
>> 
>>> Hi Sylvain
>>> 
>>> I've spent several hours trying to replicate the behavior you described on 
>>> clusters up to a couple of hundred nodes (all running slurm), without 
>>> success. I'm becoming increasingly convinced that this is a configuration 
>>> issue as opposed to a code issue.
>>> 
>>> I have enclosed the platform file I use below. Could you compare it to your 
>>> configuration? I'm wondering if there is something critical about the 
>>> config that may be causing the problem (perhaps we have a problem in our 
>>> default configuration).
>>> 
>>> Also, is there anything else you can tell us about your configuration? How 
>>> many ppn triggers it, or do you always get the behavior every time you 
>>> launch over a certain number of nodes?
>>> 
>>> Meantime, I will look into this further. I am going to introduce a "slow 
>>> down" param that will force the situation you encountered - i.e., will 
>>> ensure that the relay is still being sent when the daemon receives the 
>>> first collective input. We can then use that to try and force replication 
>>> of the behavior you are encountering.
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> enable_dlopen=no
>>> enable_pty_support=no
>>> with_blcr=no
>>> with_openib=yes
>>> with_memory_manager=no
>>> enable_mem_debug=yes
>>> enable_mem_profile=no
>>> enable_debug_symbols=yes
>>> enable_binaries=yes
>>> with_devel_headers=yes
>>> enable_heterogeneous=no
>>> enable_picky=yes
>>> enable_debug=yes
>>> enable_shared=yes
>>> enable_static=yes
>>> with_slurm=yes
>>> enable_contrib_no_build=libnbc,vt
>>> enable_visibility=yes
>>> enable_memchecker=no
>>> enable_ipv6=no
>>> enable_mpi_f77=no
>>> enable_mpi_f90=no
>>> enable_mpi_cxx=no
>>> enable_mpi_cxx_seek=no
>>> enable_mca_no_build=pml-dr,pml-crcp2,crcp
>>> enable_io_romio=no
>>> 
>>> On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:
>>> 
 
 On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:
 
> Thank you Ralph for this precious help.
> 
> I setup a quick-and-dirty patch basically postponing process_msg (hence 
> daemon_collective) until the launch is done. In process_msg, I therefore 
> requeue a process_msg handler and return.
 
 That is basically the idea I proposed, just done in a slightly different 
 place
 
> 
> In this "all-must-be-non-blocking-and-done-through-opal_progress" 
> algorithm, I don't think that blocking calls like the one in 
> daemon_collective should be allowed. This also applies to the blocking 
> one in send_relay. [Well, actually, one is okay, 2 may lead to 
> interlocking.]
 
 Well, that would be problematic - you will find "progressed_wait" used 
 repeatedly in the code. Removing them all would take a -lot- of effort and 
 a major rewrite. I'm not yet convinced it is required. There may be 
 something strange in how you are setup, or your cluster - like I said, 
 this is the first report of a problem we have had, and people with much 
 bigger slurm clusters have been running this code every day for over a 
 year.
 
> 
> If you have time doing a nicer patch, it would be great and I would be 
> happy to test it. Otherwise, I will try to implement your idea pr

[OMPI devel] Fwd: Call for participation: MPI Forum User Survey

2009-11-20 Thread Jeff Squyres
The MPI Forum announced at its SC09 BOF that they are soliciting  
community feedback to help guide the MPI-3 standards process.  A  
survey is available online at the following URL:


http://mpi-forum.questionpro.com/
Password: mpi3

In this survey, the MPI Forum is asking as many people as possible for  
feedback on the MPI-3 process -- what features to include, what  
features to not include, etc.


We encourage you to forward this survey on to as many interested and  
relevant parties as possible.


It will take approximately 10 minutes to complete the questionnaire.

No question in the survey is mandatory; feel free to only answer the  
questions which are relevant to you and your applications. Your  
answers will help the MPI Forum guide its process to create a  
genuinely useful MPI-3 standard.


This survey closes December 31, 2009.

Your survey responses will be strictly confidential and data from this  
research will be reported only in the aggregate. Your information will  
be coded and will remain confidential. If you have questions at any  
time about the survey or the procedures, you may contact the MPI Forum  
via email to mpi-comme...@mpi-forum.org.


Thank you very much for your time and support.

--
Jeff Squyres
jsquy...@cisco.com