Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r22317

2009-12-16 Thread George Bosilca
I don't think so. I had a very modest goal, it was not to fix the xgrid PLM 
(I'm not that proficient on Objective-C) but to silence the annoying compiler 
on my MAC. In fact I didn't even test it to see if its working or not, but 
based on some more or less recent complaints on the user mailing list I guess 
not.

  george.

On Dec 15, 2009, at 19:34 , Jeff Squyres (jsquyres) wrote:

> Awesome!  Does this fix the xgrid support?
> 
> -jms
> Sent from my PDA.  No type good.
> 
> - Original Message -
> From: svn-full-boun...@open-mpi.org 
> To: svn-f...@open-mpi.org 
> Sent: Tue Dec 15 19:06:37 2009
> Subject: [OMPI svn-full] svn:open-mpi r22317
> 
> Author: bosilca
> Date: 2009-12-15 19:06:37 EST (Tue, 15 Dec 2009)
> New Revision: 22317
> URL: https://svn.open-mpi.org/trac/ompi/changeset/22317
> 
> Log:
> Santa's back! Fix all warnings about the deprecated usage of
> stringWithCString as well as the casting issue between NSInteger and
> %d. The first is solved by using stringWithUTF8String, which apparently
> will always give the right answer (sic). The second is fixed as suggested
> by Apple by casting the NSInteger (hint: which by definition is large
> enough to hold a pointer) to a long and use %ld in the printf.
> 
> Text files modified:
>trunk/orte/mca/plm/xgrid/src/plm_xgrid_client.m |32 
>    
>1 files changed, 16 insertions(+), 16 deletions(-)
> 
> Modified: trunk/orte/mca/plm/xgrid/src/plm_xgrid_client.m
> ==
> --- trunk/orte/mca/plm/xgrid/src/plm_xgrid_client.m (original)
> +++ trunk/orte/mca/plm/xgrid/src/plm_xgrid_client.m 2009-12-15 19:06:37 
> EST (Tue, 15 Dec 2009)
> @@ -56,14 +56,14 @@
> OBJ_CONSTRUCT(&state_mutex, opal_mutex_t);
> 
> if (NULL != password) {
> -   controller_password = [NSString stringWithCString: password];
> +   controller_password = [NSString stringWithUTF8String: password];
> }
> if (NULL != hostname) {
> -   controller_hostname = [NSString stringWithCString: hostname];
> +   controller_hostname = [NSString stringWithUTF8String: hostname];
> }
> cleanup = val;
> if (NULL != ortedname) {
> -   orted = [NSString stringWithCString: ortedname];
> +   orted = [NSString stringWithUTF8String: ortedname];
> }
> 
> active_xgrid_jobs = [NSMutableDictionary dictionary];
> @@ -118,19 +118,19 @@
> 
>  -(void) setOrtedAsCString: (char*) name
>  {
> -orted = [NSString stringWithCString: name];
> +orted = [NSString stringWithUTF8String: name];
>  }
> 
> 
>  -(void) setControllerPasswordAsCString: (char*) name
>  {
> -controller_password = [NSString stringWithCString: name];
> +controller_password = [NSString stringWithUTF8String: name];
>  }
> 
> 
>  -(void) setControllerHostnameAsCString: (char*) password
>  {
> -controller_hostname = [NSString stringWithCString: password];
> +controller_hostname = [NSString stringWithUTF8String: password];
>  }
> 
> 
> @@ -267,7 +267,7 @@
>  NSMutableDictionary *task = [NSMutableDictionary dictionary];
> 
> /* fill in applicaton to start */
> -[task setObject: [NSString stringWithCString: orted_path]
> +[task setObject: [NSString stringWithUTF8String: orted_path]
>  forKey: XGJobSpecificationCommandKey];
> 
> /* fill in task arguments */
> @@ -281,11 +281,11 @@
> opal_output(0, "orte_plm_rsh: unable to get daemon vpid as 
> string");
> goto cleanup;
> }
> -   [taskArguments addObject: [NSString stringWithCString: vpid_string]];
> +   [taskArguments addObject: [NSString stringWithUTF8String: 
> vpid_string]];
> free(vpid_string);
> 
> [taskArguments addObject: @"--nodename"];
> -   [taskArguments addObject: [NSString stringWithCString: 
> nodes[nnode]->name]];
> +   [taskArguments addObject: [NSString stringWithUTF8String: 
> nodes[nnode]->name]];
> 
>  [task setObject: taskArguments forKey: 
> XGJobSpecificationArgumentsKey];
> 
> @@ -393,8 +393,8 @@
>  -(void) connectionDidNotOpen:(XGConnection*) myConnection withError: 
> (NSError*) error
>  {
>  opal_output(orte_plm_globals.output,
> -   "orte:plm:xgrid: Controller connection did not open: (%d) %s",
> -   [error code],
> +   "orte:plm:xgrid: Controller connection did not open: (%ld) 
> %s",
> +   (long)[error code],
> [[error localizedDescription] UTF8String]);
>  opal_condition_broadcast(&state_cond);
>  }
> @@ -411,13 +411,13 @@
> case 530:
> case 535:
> opal_output(orte_plm_globals.output,
> -   "orte:plm:xgrid: Connection to XGrid controller 
> failed due to authentication error (%d):",
> -   [[myConnection error] code]);
> +   "orte:plm

Re: [OMPI devel] carto vs. hwloc

2009-12-16 Thread George Bosilca
As far as I know what Josh did is slightly different. In the case of a complete 
restart (where all processes are restarted from a checkpoint), he setup and 
rewire a new set of BTLs.

However, it happens that we do have some code to rewire the MPI processes in 
case of failure(s) in one of UTK projects. I'll have to talk with the team 
here, to see if at this point there is something we can contribute regarding 
this matter.

  george.

On Dec 15, 2009, at 21:08 , Ralph Castain wrote:

> 
> On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote:
> 
>> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
>> 
>>> It probably should be done at a lower level, but it begs a different 
>>> question. For example, I've created the capability  in the new cluster 
>>> manager to detect interfaces that are lost, ride through the problem by 
>>> moving affected procs to other nodes (reconnecting ORTE-level comm), and 
>>> move procs back if/when nodes reappear. So someone can remove a node 
>>> "on-the-fly" and replace that hardware with another node without having to 
>>> stop and restart the job, etc. A lot of that infrastructure is now down 
>>> inside ORTE, though a few key pieces remain in the ORCM code base (and most 
>>> likely will stay there).
>>> 
>>> Works great - unless it is an MPI job. If we can figure out a way for the 
>>> MPI procs to (a) be properly restarted on the "new" node, and (b) update 
>>> the BTL connection info on the other MPI procs in the job, then we would be 
>>> good to go...
>>> 
>>> Trivial problem, I am sure :-)
>> 
>> ...actually, the groundwork is there with Josh's work, isn't it?  I think 
>> the real issue is handling un-graceful BTL failures properly.  I'm guessing 
>> that's the biggest piece that isn't done...?
> 
> Think sonot sure how to update the BTL's with the new info, but perhaps 
> Josh has already done that problem.
> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r22313

2009-12-16 Thread Vasily Philipov

Hello all.
To Jeff:
   I thought that if there are no replies it means OK.
   Thank you for your comments, I fixed it, you can see the patch below. 


Jeff Squyres wrote:

On Dec 15, 2009, at 8:56 PM, Jeff Squyres wrote:

  

Hmm.  I'm a little disappointed that this was applied without answering my 
questions first...

   http://www.open-mpi.org/community/lists/devel/2009/12/7187.php



WRONG.  You *did* answer -- somehow my mail client ate it (I see the reply in 
the web archives, but not in my local mail client -- #$@!$@!#$).

My bad...  :-(

Could you add some of your explanations as comments in the code?  The rationale 
here is that if I had those questions while reading your patch, someone else 
(including me, months from now) will likely have the same questions while 
reading the code.

Another minor quibble in a help message:

+[SRQ doesn't found]
+The srq doesn't found.
+Below is some information about the host that raised the error:
+
+Local host:   %s
+Local device: %s

It's not correct grammar and is fairly unhelpful to the user -- please change 
to:

[SRQ not found]
Open MPI tried to access a shared receive queue (SRQ) that was not found.  This 
should not happen, and is a fatal error.  Your MPI job will now abort.

Local host:   %s
Local device: %s

Also:

+  - When the number of not used receive buffers will decreased to 8
+the IBV_EVENT_SRQ_LIMIT_REACHED event will be signaled and the number
+of receive buffers that we can pre-post will be increased.

I don't think users know what IBV_EVENT_... is.  Perhaps it should read:

+  - When the number of unused shared receive buffers reaches 8, more
+buffers will be posted.

(how many more buffers will be posted, BTW?)



  


Index: ompi/mca/btl/openib/help-mpi-btl-openib.txt
===
--- ompi/mca/btl/openib/help-mpi-btl-openib.txt (revision 22318)
+++ ompi/mca/btl/openib/help-mpi-btl-openib.txt (working copy)
@@ -168,9 +168,9 @@
 You may need to consult with your system administrator to get this
 problem fixed.
 #
-[SRQ doesn't found]
-The srq doesn't found.
-Below is some information about the host that raised the error:
+[SRQ not found]
+Open MPI tried to access a shared receive queue (SRQ) that was not found.
+This should not happen, and is a fatal error.  Your MPI job will now abort.

 Local host:   %s
 Local device: %s
@@ -411,9 +411,8 @@
   - A sender will not send to a peer unless it has less than 32
 outstanding sends to that peer.
   - 32 receive buffers will be preposted.
-  - When the number of not used receive buffers will decreased to 8
-the IBV_EVENT_SRQ_LIMIT_REACHED event will be signaled and the number
-of receive buffers that we can pre-post will be increased.
+  - When the number of unused shared receive buffers reaches 8, more
+buffers (32 in this case) will be posted.

   Local host: %s
   Bad queue specification: %s
Index: ompi/mca/btl/openib/btl_openib.h
===
--- ompi/mca/btl/openib/btl_openib.h(revision 22318)
+++ ompi/mca/btl/openib/btl_openib.h(working copy)
@@ -381,6 +381,15 @@
 /** The flag points if we want to get the 
  IBV_EVENT_SRQ_LIMIT_REACHED events for dynamically resizing SRQ */
 bool srq_limit_event_flag;
+/**< In difference of the "--mca enable_srq_resize" parameter that says, 
if we want(or no)
+ to start with small num of pre-posted receive buffers (rd_curr_num) 
and to increase this number by needs
+ (the max of this value is rd_num – the whole size of SRQ), the 
"srq_limit_event_flag" says if we want to get limit event
+ from device if the defined srq limit was reached (signal to the main 
thread) and we put off this flag if the rd_curr_num
+ was increased up to rd_num.
+ In order to prevent lock/unlock operation in the critical path we 
prefer only put-on
+ the srq_limit_event_flag in asynchronous thread, because in this way 
we post receive buffers
+ in the main thread only and only after posting we set (if 
srq_limit_event_flag is true)
+ the limit for IBV_EVENT_SRQ_LIMIT_REACHED event. */
 }; typedef struct mca_btl_openib_module_srq_qp_t 
mca_btl_openib_module_srq_qp_t;

 struct mca_btl_openib_module_qp_t {


Re: [OMPI devel] carto vs. hwloc

2009-12-16 Thread Kenneth Lloyd




> -Original Message-
> From: devel-boun...@open-mpi.org 
> [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres
> Sent: Tuesday, December 15, 2009 6:32 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] carto vs. hwloc
> 
> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
> 
> > It probably should be done at a lower level, but it begs a 
> different question. For example, I've created the capability  
> in the new cluster manager to detect interfaces that are 
> lost, ride through the problem by moving affected procs to 
> other nodes (reconnecting ORTE-level comm), and move procs 
> back if/when nodes reappear. So someone can remove a node 
> "on-the-fly" and replace that hardware with another node 
> without having to stop and restart the job, etc. A lot of 
> that infrastructure is now down inside ORTE, though a few key 
> pieces remain in the ORCM code base (and most likely will stay there).
> > 
> > Works great - unless it is an MPI job. If we can figure out 
> a way for the MPI procs to (a) be properly restarted on the 
> "new" node, and (b) update the BTL connection info on the 
> other MPI procs in the job, then we would be good to go...
> > 
> > Trivial problem, I am sure :-)
> 
> ...actually, the groundwork is there with Josh's work, isn't 
> it?  I think the real issue is handling un-graceful BTL 
> failures properly.  I'm guessing that's the biggest piece 
> that isn't done...?

Precisely.  Why the BTL, or why not at the PTL? (Where these issues rightly
belong, IMO).

Ken Lloyd


> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] carto vs. hwloc

2009-12-16 Thread Joshua Hursey
Currently, I am working on process migration and automatic recovery based on 
checkpoint/restart. WRT the PML stack, this works by rewiring the BTLs after 
restart of the migrated/recovered MPI process(es). There is a fair amount of 
work in getting this right with respect to both the runtime and the OMPI layer 
(particularly the modex). For the automatic recovery with C/R we will, at 
first, require the restart of all processes in the job [for consistency]. For 
migration, only those processes moving will need to be restarted, all others 
may be blocked.

I think what you are looking for is the ability to lose a process and replace 
it without restarting all the rest of the processes. This would require a bit 
more work beyond what I am currently working on. Since you will need to flush 
the PML/BML/BTL stack of latent messages, etc. The message logging work by UTK 
should do this anyway (if they use uncoordinated C/R+message logging), but they 
will have to fill in the details on that project.

-- Josh

On Dec 16, 2009, at 1:32 AM, George Bosilca wrote:

> As far as I know what Josh did is slightly different. In the case of a 
> complete restart (where all processes are restarted from a checkpoint), he 
> setup and rewire a new set of BTLs.
> 
> However, it happens that we do have some code to rewire the MPI processes in 
> case of failure(s) in one of UTK projects. I'll have to talk with the team 
> here, to see if at this point there is something we can contribute regarding 
> this matter.
> 
>  george.
> 
> On Dec 15, 2009, at 21:08 , Ralph Castain wrote:
> 
>> 
>> On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote:
>> 
>>> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
>>> 
 It probably should be done at a lower level, but it begs a different 
 question. For example, I've created the capability  in the new cluster 
 manager to detect interfaces that are lost, ride through the problem by 
 moving affected procs to other nodes (reconnecting ORTE-level comm), and 
 move procs back if/when nodes reappear. So someone can remove a node 
 "on-the-fly" and replace that hardware with another node without having to 
 stop and restart the job, etc. A lot of that infrastructure is now down 
 inside ORTE, though a few key pieces remain in the ORCM code base (and 
 most likely will stay there).
 
 Works great - unless it is an MPI job. If we can figure out a way for the 
 MPI procs to (a) be properly restarted on the "new" node, and (b) update 
 the BTL connection info on the other MPI procs in the job, then we would 
 be good to go...
 
 Trivial problem, I am sure :-)
>>> 
>>> ...actually, the groundwork is there with Josh's work, isn't it?  I think 
>>> the real issue is handling un-graceful BTL failures properly.  I'm guessing 
>>> that's the biggest piece that isn't done...?
>> 
>> Think sonot sure how to update the BTL's with the new info, but perhaps 
>> Josh has already done that problem.
>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Bug or feature?

2009-12-16 Thread Jeff Squyres
I would tend to agree with Paul.

It's uncommon (e.g., no one has run into this before now), and I would say that 
this is a bad application.  But then again, hanging is bad -- so it would be 
better to abort/terminate the whole job in this scenario.

I don't know how I would rate the priority of this, but it would be nice to 
have someday.


On Dec 15, 2009, at 11:17 PM, Ralph Castain wrote:

> Understandable - and we can count on your patch in the near future, then? :-)
> 
> On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
> 
> > My 0.02USD says that for pragmatic reasons one should attempt to terminate 
> > the job in this case, regardless of ones opinion of this unusual 
> > application behavior.
> >
> > -Paul
> >
> > Ralph Castain wrote:
> >> Hi folks
> >>
> >> In case you didn't follow this on the user list, we had a question come up 
> >> about proper OMPI behavior. Basically, the user has an application where 
> >> one process decides it should cleanly terminate prior to calling MPI_Init, 
> >> but all the others go ahead and enter MPI_Init. The application hangs 
> >> since we don't detect the one proc's exit as an abnormal termination (no 
> >> segfault, and it didn't call MPI_Init so it isn't required to call 
> >> MPI_Finalize prior to termination).
> >>
> >> I can probably come up with a way to detect this scenario and abort it. 
> >> But before I spend the effort chasing this down, my question to you MPI 
> >> folks is:
> >>
> >> What -should- OMPI do in this situation? We have never previously detected 
> >> such behavior - was this an oversight, or is this simply a "bad" 
> >> application?
> >>
> >> Thanks
> >> Ralph
> >>
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >
> >
> > --
> > Paul H. Hargrove  phhargr...@lbl.gov
> > Future Technologies Group Tel: +1-510-495-2352
> > HPC Research Department   Fax: +1-510-486-6900
> > Lawrence Berkeley National Laboratory
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI devel] Bug or feature?

2009-12-16 Thread George Bosilca
There are two citation from the MPI standard that I would like to highlight.

> All MPI programs must contain exactly one call to an MPI initialization 
> routine: MPI_INIT or MPI_INIT_THREAD.

> One goal of MPI is to achieve source code portability. By this we mean that a 
> program written using MPI and complying with the relevant language standards 
> is portable as written, and must not require any source code changes when 
> moved from one system to another. This explicitly does not say anything about 
> how an MPI program is started or launched from the command line, nor what the 
> user must do to set up the environment in which an MPI program will run. 
> However, an implementation may require some setup to be performed before 
> other MPI routines may be called. To provide for this, MPI includes an 
> initialization routine MPI_INIT.

While these two statement do not necessarily clarify the original question, 
they highlight an acceptable solution. Before exiting the MPI_Init function 
(which we don't have to assume as being collective), any "MPI-like" process can 
be killed without problems (we can even claim that we call the default error 
handler). For those that successfully exited the MPI_Init, I guess the next MPI 
call will have to trigger the error handler and these processes should be 
allowed to gracefully exit.

So, while it is clear that the best approach is to allow even bad application 
to terminate, it is better if we follow what MPI describe as a "high quality 
implementation".

  george.


On Dec 15, 2009, at 23:17 , Ralph Castain wrote:

> Understandable - and we can count on your patch in the near future, then? :-)
> 
> On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
> 
>> My 0.02USD says that for pragmatic reasons one should attempt to terminate 
>> the job in this case, regardless of ones opinion of this unusual application 
>> behavior.
>> 
>> -Paul
>> 
>> Ralph Castain wrote:
>>> Hi folks
>>> 
>>> In case you didn't follow this on the user list, we had a question come up 
>>> about proper OMPI behavior. Basically, the user has an application where 
>>> one process decides it should cleanly terminate prior to calling MPI_Init, 
>>> but all the others go ahead and enter MPI_Init. The application hangs since 
>>> we don't detect the one proc's exit as an abnormal termination (no 
>>> segfault, and it didn't call MPI_Init so it isn't required to call 
>>> MPI_Finalize prior to termination).
>>> 
>>> I can probably come up with a way to detect this scenario and abort it. But 
>>> before I spend the effort chasing this down, my question to you MPI folks 
>>> is:
>>> 
>>> What -should- OMPI do in this situation? We have never previously detected 
>>> such behavior - was this an oversight, or is this simply a "bad" 
>>> application?
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group Tel: +1-510-495-2352
>> HPC Research Department   Fax: +1-510-486-6900
>> Lawrence Berkeley National Laboratory 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] SEGFAULT in mpi_init from paffinity with intel 11.1.059 compiler

2009-12-16 Thread Daan van Rossum
Hi Terry,

Thanks for your hint. I tried configure --enable-debug and even compiled it 
with all kind of manual debug flags turned on, but it doesn't help to get rid 
of this problem. So it definitively is not an optimization flaw.
One more interesting test would be to try an older version of the Intel 
compiler. But the next older version that I have is 10.0.015, which is too old 
for the operating system (must be >10.1).


A good thing is that this bug is very easy to test. You only need one line of 
MPI code and one process in the execution.

A few more test cases:
 rank 0=node01 slot=1-7
and
 rank 0=node01 slot=0,2-7
and
 rank 0=node01 slot=0-1,3-7
work WELL.
But
 rank 0=node01 slot=0-2,4-7
FAILS.

As long as either slot 0, 1, OR 2 is excluded from the list it's allright. 
Excluding a different slot, like slot 3, does not help.


I'll try to get hold of an Intel v10.1 compiler version.

Best,
Daan

* on Monday, 14.12.09 at 14:57, Terry Dontje  wrote:

> I don't really want to throw fud on this list but we've seen all
> sorts of oddities with OMPI 1.3.4 being built with Intel's 11.1
> compiler versus their 11.0 or other compilers (gcc, Sun Studio, pgi,
> and pathscale).  I have not tested your specific failing case but
> considering your issue doesn't show up with gcc I am wondering if
> there is some sort of optimization issue with the 11.1 compiler.
> 
> It might be interesting to see if using certain optimization levels
> with the Intel 11.1 compiler produces a working OMPI library.
> 
> --td
> 
> Daan van Rossum wrote:
> >Hi Ralph,
> >
> >I took the Dec 10th snapshot, but got exactly the same behavior as with 
> >version 1.3.4.
> >
> >I just noticed that even this rankfile doesn't work, with a single process:
> > rank 0=node01 slot=0-3
> >
> >
> >[node01:31105] mca:base:select:(paffinity) Querying component [linux]
> >[node01:31105] mca:base:select:(paffinity) Query of component [linux] set 
> >priority to 10
> >[node01:31105] mca:base:select:(paffinity) Selected component [linux]
> >[node01:31105] paffinity slot assignment: slot_list == 0-3
> >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> >[node01:31106] mca:base:select:(paffinity) Querying component [linux]
> >[node01:31106] mca:base:select:(paffinity) Query of component [linux] set 
> >priority to 10
> >[node01:31106] mca:base:select:(paffinity) Selected component [linux]
> >[node01:31106] paffinity slot assignment: slot_list == 0-3
> >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> >[node01:31106] *** An error occurred in MPI_Comm_rank
> >[node01:31106] *** on a NULL communicator
> >[node01:31106] *** Unknown error
> >[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> >forrtl: severe (174): SIGSEGV, segmentation fault occurred
> >
> >
> >The spawned compute process doesn't sense that it should skip the setting 
> >paffinity...
> >
> >
> >I saw the posting from last July about a similar problem (the problem that I 
> >mentioned on the bottom, with the slot=0:* notation not working). But that 
> >is a different problem (besides, that is still not working as it seems).
> >
> >Best,
> >Daan
> >
> >* on Saturday, 12.12.09 at 18:48, Ralph Castain  wrote:
> >
> >>This looks like an uninitialized variable that gnu c handles one way and 
> >>intel another. Someone recently contributed a patch to the ompi trunk to 
> >>fix just such a  thing in this code area - don't know if it addresses this 
> >>problem or not.
> >>
> >>Can you try the ompi trunk (a nightly tarball from the last day or so 
> >>forward) and see if this still occurs?
> >>
> >>Thanks
> >>Ralph
> >>
> >>On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
> >>
> >>>Hi all,
> >>>
> >>>There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c 
> >>>compiler, related with the built in processor binding functionallity. The 
> >>>problem does not occur when ompi is compiled with the gnu c compiler.
> >>>
> >>>A mpi program execution fails (segfault) on mpi_init() when the following 
> >>>rank file is used:
> >>>rank 0=node01 slot=0-3
> >>>rank 1=node01 slot=0-3
> >>>but runs fine with:
> >>>rank 0=node01 slot=0
> >>>rank 1=node01 slot=1-3
> >>>and fine with:
> >>>rank 0=node01 slot=0-1
> >>>rank 1=node01 slot=1-3
> >>>but segfaults with:
> >>>rank 0=node01 slot=0-2
> >>>rank 1=node01 slot=1-3
> >>>
> >>>This is on a two-processor quad-core opteron machine (occurs on all nodes 
> >>>of the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
> >>>This is the siplest case tha

Re: [OMPI devel] SEGFAULT in mpi_init from paffinity with intel 11.1.059 compiler

2009-12-16 Thread Lenny Verkhovsky
Hi,
can you provide $cat /proc/cpuinfo
I am not optimistic that it will help, but still...
thanks
Lenny.

On Wed, Dec 16, 2009 at 6:01 PM, Daan van Rossum wrote:

> Hi Terry,
>
> Thanks for your hint. I tried configure --enable-debug and even compiled it
> with all kind of manual debug flags turned on, but it doesn't help to get
> rid of this problem. So it definitively is not an optimization flaw.
> One more interesting test would be to try an older version of the Intel
> compiler. But the next older version that I have is 10.0.015, which is too
> old for the operating system (must be >10.1).
>
>
> A good thing is that this bug is very easy to test. You only need one line
> of MPI code and one process in the execution.
>
> A few more test cases:
>  rank 0=node01 slot=1-7
> and
>  rank 0=node01 slot=0,2-7
> and
>  rank 0=node01 slot=0-1,3-7
> work WELL.
> But
>  rank 0=node01 slot=0-2,4-7
> FAILS.
>
> As long as either slot 0, 1, OR 2 is excluded from the list it's allright.
> Excluding a different slot, like slot 3, does not help.
>
>
> I'll try to get hold of an Intel v10.1 compiler version.
>
> Best,
> Daan
>
> * on Monday, 14.12.09 at 14:57, Terry Dontje  wrote:
>
> > I don't really want to throw fud on this list but we've seen all
> > sorts of oddities with OMPI 1.3.4 being built with Intel's 11.1
> > compiler versus their 11.0 or other compilers (gcc, Sun Studio, pgi,
> > and pathscale).  I have not tested your specific failing case but
> > considering your issue doesn't show up with gcc I am wondering if
> > there is some sort of optimization issue with the 11.1 compiler.
> >
> > It might be interesting to see if using certain optimization levels
> > with the Intel 11.1 compiler produces a working OMPI library.
> >
> > --td
> >
> > Daan van Rossum wrote:
> > >Hi Ralph,
> > >
> > >I took the Dec 10th snapshot, but got exactly the same behavior as with
> version 1.3.4.
> > >
> > >I just noticed that even this rankfile doesn't work, with a single
> process:
> > > rank 0=node01 slot=0-3
> > >
> > >
> > >[node01:31105] mca:base:select:(paffinity) Querying component [linux]
> > >[node01:31105] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >[node01:31105] mca:base:select:(paffinity) Selected component [linux]
> > >[node01:31105] paffinity slot assignment: slot_list == 0-3
> > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > >[node01:31106] mca:base:select:(paffinity) Querying component [linux]
> > >[node01:31106] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >[node01:31106] mca:base:select:(paffinity) Selected component [linux]
> > >[node01:31106] paffinity slot assignment: slot_list == 0-3
> > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > >[node01:31106] *** An error occurred in MPI_Comm_rank
> > >[node01:31106] *** on a NULL communicator
> > >[node01:31106] *** Unknown error
> > >[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> > >forrtl: severe (174): SIGSEGV, segmentation fault occurred
> > >
> > >
> > >The spawned compute process doesn't sense that it should skip the
> setting paffinity...
> > >
> > >
> > >I saw the posting from last July about a similar problem (the problem
> that I mentioned on the bottom, with the slot=0:* notation not working). But
> that is a different problem (besides, that is still not working as it
> seems).
> > >
> > >Best,
> > >Daan
> > >
> > >* on Saturday, 12.12.09 at 18:48, Ralph Castain 
> wrote:
> > >
> > >>This looks like an uninitialized variable that gnu c handles one way
> and intel another. Someone recently contributed a patch to the ompi trunk to
> fix just such a  thing in this code area - don't know if it addresses this
> problem or not.
> > >>
> > >>Can you try the ompi trunk (a nightly tarball from the last day or so
> forward) and see if this still occurs?
> > >>
> > >>Thanks
> > >>Ralph
> > >>
> > >>On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
> > >>
> > >>>Hi all,
> > >>>
> > >>>There's a problem with ompi 1.3.4 when compiled with the intel
> 11.1.059 c compiler, related with the built in processor binding
> functionallity. The problem does not occur when ompi is compiled with the
> gnu c compiler.
> > >>>
> > >>>A mpi program execution fails (segfault) on mpi_init() when the
> following rank file is used:
> > >>>rank 0=node01 slot=0-3
> > >>>rank 1=node01 slot=0-3
> > >>>but runs fine with:
> > >>>rank 0=node01 slot=0
> > >>>rank 1=node01 slot=1-3
>

Re: [OMPI devel] SEGFAULT in mpi_init from paffinity with intel 11.1.059 compiler

2009-12-16 Thread Daan van Rossum
Sure. Processors were scaled down while idling to 1000MHz
 (I hope this will show up as attachement instead of inlined...)

* on Wednesday, 16.12.09 at 18:12, Lenny Verkhovsky 
 wrote:

> Hi,
> can you provide $cat /proc/cpuinfo
> I am not optimistic that it will help, but still...
> thanks
> Lenny.
> 
> On Wed, Dec 16, 2009 at 6:01 PM, Daan van Rossum 
> wrote:
> 
> > Hi Terry,
> >
> > Thanks for your hint. I tried configure --enable-debug and even compiled it
> > with all kind of manual debug flags turned on, but it doesn't help to get
> > rid of this problem. So it definitively is not an optimization flaw.
> > One more interesting test would be to try an older version of the Intel
> > compiler. But the next older version that I have is 10.0.015, which is too
> > old for the operating system (must be >10.1).
> >
> >
> > A good thing is that this bug is very easy to test. You only need one line
> > of MPI code and one process in the execution.
> >
> > A few more test cases:
> >  rank 0=node01 slot=1-7
> > and
> >  rank 0=node01 slot=0,2-7
> > and
> >  rank 0=node01 slot=0-1,3-7
> > work WELL.
> > But
> >  rank 0=node01 slot=0-2,4-7
> > FAILS.
> >
> > As long as either slot 0, 1, OR 2 is excluded from the list it's allright.
> > Excluding a different slot, like slot 3, does not help.
> >
> >
> > I'll try to get hold of an Intel v10.1 compiler version.
> >
> > Best,
> > Daan
> >
> > * on Monday, 14.12.09 at 14:57, Terry Dontje  wrote:
> >
> > > I don't really want to throw fud on this list but we've seen all
> > > sorts of oddities with OMPI 1.3.4 being built with Intel's 11.1
> > > compiler versus their 11.0 or other compilers (gcc, Sun Studio, pgi,
> > > and pathscale).  I have not tested your specific failing case but
> > > considering your issue doesn't show up with gcc I am wondering if
> > > there is some sort of optimization issue with the 11.1 compiler.
> > >
> > > It might be interesting to see if using certain optimization levels
> > > with the Intel 11.1 compiler produces a working OMPI library.
> > >
> > > --td
> > >
> > > Daan van Rossum wrote:
> > > >Hi Ralph,
> > > >
> > > >I took the Dec 10th snapshot, but got exactly the same behavior as with
> > version 1.3.4.
> > > >
> > > >I just noticed that even this rankfile doesn't work, with a single
> > process:
> > > > rank 0=node01 slot=0-3
> > > >
> > > >
> > > >[node01:31105] mca:base:select:(paffinity) Querying component [linux]
> > > >[node01:31105] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >[node01:31105] mca:base:select:(paffinity) Selected component [linux]
> > > >[node01:31105] paffinity slot assignment: slot_list == 0-3
> > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > > >[node01:31106] mca:base:select:(paffinity) Querying component [linux]
> > > >[node01:31106] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >[node01:31106] mca:base:select:(paffinity) Selected component [linux]
> > > >[node01:31106] paffinity slot assignment: slot_list == 0-3
> > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > > >[node01:31106] *** An error occurred in MPI_Comm_rank
> > > >[node01:31106] *** on a NULL communicator
> > > >[node01:31106] *** Unknown error
> > > >[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> > > >forrtl: severe (174): SIGSEGV, segmentation fault occurred
> > > >
> > > >
> > > >The spawned compute process doesn't sense that it should skip the
> > setting paffinity...
> > > >
> > > >
> > > >I saw the posting from last July about a similar problem (the problem
> > that I mentioned on the bottom, with the slot=0:* notation not working). But
> > that is a different problem (besides, that is still not working as it
> > seems).
> > > >
> > > >Best,
> > > >Daan
> > > >
> > > >* on Saturday, 12.12.09 at 18:48, Ralph Castain 
> > wrote:
> > > >
> > > >>This looks like an uninitialized variable that gnu c handles one way
> > and intel another. Someone recently contributed a patch to the ompi trunk to
> > fix just such a  thing in this code area - don't know if it addresses this
> > problem or not.
> > > >>
> > > >>Can you try the ompi trunk (a nightly tarball from the last day or so
> > forward) and see if this still occurs?
> > > >>
> > > >>Thanks
> > > >>Ralph
> > > >>
> > > >>On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
> > > >>
> > > >>>Hi all,
> > > >>>
> > > >>>There's a problem with ompi 1.3.4 when compiled

Re: [OMPI devel] Bug or feature?

2009-12-16 Thread Jeff Squyres
I think I understand you're saying:

- it's ok to abort during MPI_INIT (we can rationalize it as the default error 
handler)
- we should only abort during MPI functions

Is that right?  If so, I agree with your interpretation.  :-)  ...with one 
addition: it's ok to abort before MPI_INIT, because the MPI spec makes no 
guarantees about what happens before MPI_INIT.

Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 process 
calls MPI_INIT, then it is reasonable for OMPI to expect there to be N 
MPI_INIT's.  If any process exits without calling MPI_INIT -- regardless of 
that process' exit status -- it should be treated as an error.

Don't forget that we have a barrier in MPI_INIT (in most cases), so aborting 
when ORTE detects that a) at least one process has called MPI_INIT, and b) at 
least one process has exited without calling MPI_INIT, is acceptable to me.  
It's also acceptable to the first point above, because all the other processes 
are either stuck in the MPI_INIT (either at the barrier or getting there) or 
haven't yet entered MPI_INIT -- and the MPI spec makes no guarantees about what 
happens before MPI_INIT.

Does that make sense?



On Dec 16, 2009, at 10:06 AM, George Bosilca wrote:

> There are two citation from the MPI standard that I would like to highlight.
> 
> > All MPI programs must contain exactly one call to an MPI initialization 
> > routine: MPI_INIT or MPI_INIT_THREAD.
> 
> > One goal of MPI is to achieve source code portability. By this we mean that 
> > a program written using MPI and complying with the relevant language 
> > standards is portable as written, and must not require any source code 
> > changes when moved from one system to another. This explicitly does not say 
> > anything about how an MPI program is started or launched from the command 
> > line, nor what the user must do to set up the environment in which an MPI 
> > program will run. However, an implementation may require some setup to be 
> > performed before other MPI routines may be called. To provide for this, MPI 
> > includes an initialization routine MPI_INIT.
> 
> While these two statement do not necessarily clarify the original question, 
> they highlight an acceptable solution. Before exiting the MPI_Init function 
> (which we don't have to assume as being collective), any "MPI-like" process 
> can be killed without problems (we can even claim that we call the default 
> error handler). For those that successfully exited the MPI_Init, I guess the 
> next MPI call will have to trigger the error handler and these processes 
> should be allowed to gracefully exit.
> 
> So, while it is clear that the best approach is to allow even bad application 
> to terminate, it is better if we follow what MPI describe as a "high quality 
> implementation".
> 
>   george.
> 
> 
> On Dec 15, 2009, at 23:17 , Ralph Castain wrote:
> 
> > Understandable - and we can count on your patch in the near future, then? 
> > :-)
> >
> > On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
> >
> >> My 0.02USD says that for pragmatic reasons one should attempt to terminate 
> >> the job in this case, regardless of ones opinion of this unusual 
> >> application behavior.
> >>
> >> -Paul
> >>
> >> Ralph Castain wrote:
> >>> Hi folks
> >>>
> >>> In case you didn't follow this on the user list, we had a question come 
> >>> up about proper OMPI behavior. Basically, the user has an application 
> >>> where one process decides it should cleanly terminate prior to calling 
> >>> MPI_Init, but all the others go ahead and enter MPI_Init. The application 
> >>> hangs since we don't detect the one proc's exit as an abnormal 
> >>> termination (no segfault, and it didn't call MPI_Init so it isn't 
> >>> required to call MPI_Finalize prior to termination).
> >>>
> >>> I can probably come up with a way to detect this scenario and abort it. 
> >>> But before I spend the effort chasing this down, my question to you MPI 
> >>> folks is:
> >>>
> >>> What -should- OMPI do in this situation? We have never previously 
> >>> detected such behavior - was this an oversight, or is this simply a "bad" 
> >>> application?
> >>>
> >>> Thanks
> >>> Ralph
> >>>
> >>>
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>
> >>
> >> --
> >> Paul H. Hargrove  phhargr...@lbl.gov
> >> Future Technologies Group Tel: +1-510-495-2352
> >> HPC Research Department   Fax: +1-510-486-6900
> >> Lawrence Berkeley National Laboratory
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 

Re: [OMPI devel] Bug or feature?

2009-12-16 Thread George Bosilca
Makes perfect sense.

  george.

On Dec 16, 2009, at 13:27 , Jeff Squyres wrote:

> I think I understand you're saying:
> 
> - it's ok to abort during MPI_INIT (we can rationalize it as the default 
> error handler)
> - we should only abort during MPI functions
> 
> Is that right?  If so, I agree with your interpretation.  :-)  ...with one 
> addition: it's ok to abort before MPI_INIT, because the MPI spec makes no 
> guarantees about what happens before MPI_INIT.
> 
> Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 
> process calls MPI_INIT, then it is reasonable for OMPI to expect there to be 
> N MPI_INIT's.  If any process exits without calling MPI_INIT -- regardless of 
> that process' exit status -- it should be treated as an error.
> 
> Don't forget that we have a barrier in MPI_INIT (in most cases), so aborting 
> when ORTE detects that a) at least one process has called MPI_INIT, and b) at 
> least one process has exited without calling MPI_INIT, is acceptable to me.  
> It's also acceptable to the first point above, because all the other 
> processes are either stuck in the MPI_INIT (either at the barrier or getting 
> there) or haven't yet entered MPI_INIT -- and the MPI spec makes no 
> guarantees about what happens before MPI_INIT.
> 
> Does that make sense?
> 
> 
> 
> On Dec 16, 2009, at 10:06 AM, George Bosilca wrote:
> 
>> There are two citation from the MPI standard that I would like to highlight.
>> 
>>> All MPI programs must contain exactly one call to an MPI initialization 
>>> routine: MPI_INIT or MPI_INIT_THREAD.
>> 
>>> One goal of MPI is to achieve source code portability. By this we mean that 
>>> a program written using MPI and complying with the relevant language 
>>> standards is portable as written, and must not require any source code 
>>> changes when moved from one system to another. This explicitly does not say 
>>> anything about how an MPI program is started or launched from the command 
>>> line, nor what the user must do to set up the environment in which an MPI 
>>> program will run. However, an implementation may require some setup to be 
>>> performed before other MPI routines may be called. To provide for this, MPI 
>>> includes an initialization routine MPI_INIT.
>> 
>> While these two statement do not necessarily clarify the original question, 
>> they highlight an acceptable solution. Before exiting the MPI_Init function 
>> (which we don't have to assume as being collective), any "MPI-like" process 
>> can be killed without problems (we can even claim that we call the default 
>> error handler). For those that successfully exited the MPI_Init, I guess the 
>> next MPI call will have to trigger the error handler and these processes 
>> should be allowed to gracefully exit.
>> 
>> So, while it is clear that the best approach is to allow even bad 
>> application to terminate, it is better if we follow what MPI describe as a 
>> "high quality implementation".
>> 
>>  george.
>> 
>> 
>> On Dec 15, 2009, at 23:17 , Ralph Castain wrote:
>> 
>>> Understandable - and we can count on your patch in the near future, then? 
>>> :-)
>>> 
>>> On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
>>> 
 My 0.02USD says that for pragmatic reasons one should attempt to terminate 
 the job in this case, regardless of ones opinion of this unusual 
 application behavior.
 
 -Paul
 
 Ralph Castain wrote:
> Hi folks
> 
> In case you didn't follow this on the user list, we had a question come 
> up about proper OMPI behavior. Basically, the user has an application 
> where one process decides it should cleanly terminate prior to calling 
> MPI_Init, but all the others go ahead and enter MPI_Init. The application 
> hangs since we don't detect the one proc's exit as an abnormal 
> termination (no segfault, and it didn't call MPI_Init so it isn't 
> required to call MPI_Finalize prior to termination).
> 
> I can probably come up with a way to detect this scenario and abort it. 
> But before I spend the effort chasing this down, my question to you MPI 
> folks is:
> 
> What -should- OMPI do in this situation? We have never previously 
> detected such behavior - was this an oversight, or is this simply a "bad" 
> application?
> 
> Thanks
> Ralph
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
 
 
 --
 Paul H. Hargrove  phhargr...@lbl.gov
 Future Technologies Group Tel: +1-510-495-2352
 HPC Research Department   Fax: +1-510-486-6900
 Lawrence Berkeley National Laboratory
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> __

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-16 Thread Ralph Castain
Argh. I know the problem here - per note on user list, I actually found more 
than five months ago that we weren't properly serializing commands in the 
system and created a fix for it. I applied that fix only to the comm_spawn 
scenario at the time as this was the source of the pain - but I noted in my 
commit that this needed to be propagated everywhere we were doing async message 
receives (see r21717).

Unfortunately, I then got sidetracked onto a dozen other things...and never 
completed the fixes to the rest of the code base. Sigh.

Take a look at orte/mca/rml/rml_types.h and you will see the required macro 
called ORTE_PROCESS_MESSAGE. Then look in orte/mca/plm/base/plm_base_receive.c 
and you will see what needs to be done to the places where messages are 
received and we use ORTE_PROGRESSED_WAIT.

I'll try to get the area you noted fixed over the next few days. Then will try 
to work my way thru the code base as originally noted.

Ralph

On Dec 3, 2009, at 9:31 AM, Sylvain Jeaugey wrote:

> Too bad. But no problem, that's very nice of you to have spent so much time 
> on this.
> 
> I wish I knew why our experiments are so different, maybe we will find out 
> eventually ...
> 
> Sylvain
> 
> On Wed, 2 Dec 2009, Ralph Castain wrote:
> 
>> I'm sorry, Sylvain - I simply cannot replicate this problem (tried yet 
>> another slurm system):
>> 
>> ./configure --prefix=blah --with-platform=contrib/platform/iu/odin/debug
>> 
>> [rhc@odin ~]$ salloc -N 16 tcsh
>> salloc: Granted job allocation 75294
>> [rhc@odin mpi]$ mpirun -pernode ./hello
>> Hello, World, I am 1 of 16
>> Hello, World, I am 7 of 16
>> Hello, World, I am 15 of 16
>> Hello, World, I am 4 of 16
>> Hello, World, I am 13 of 16
>> Hello, World, I am 3 of 16
>> Hello, World, I am 5 of 16
>> Hello, World, I am 8 of 16
>> Hello, World, I am 0 of 16
>> Hello, World, I am 9 of 16
>> Hello, World, I am 12 of 16
>> Hello, World, I am 2 of 16
>> Hello, World, I am 6 of 16
>> Hello, World, I am 10 of 16
>> Hello, World, I am 14 of 16
>> Hello, World, I am 11 of 16
>> [rhc@odin mpi]$ setenv ORTE_RELAY_DELAY 1
>> [rhc@odin mpi]$ mpirun -pernode ./hello
>> [odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
>> [odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
>> [odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
>> [odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
>> Hello, World, I am 2 of 16
>> Hello, World, I am 0 of 16
>> Hello, World, I am 3 of 16
>> Hello, World, I am 1 of 16
>> Hello, World, I am 4 of 16
>> Hello, World, I am 10 of 16
>> Hello, World, I am 7 of 16
>> Hello, World, I am 12 of 16
>> Hello, World, I am 6 of 16
>> Hello, World, I am 8 of 16
>> Hello, World, I am 5 of 16
>> Hello, World, I am 13 of 16
>> Hello, World, I am 11 of 16
>> Hello, World, I am 14 of 16
>> Hello, World, I am 9 of 16
>> Hello, World, I am 15 of 16
>> [odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
>> [rhc@odin mpi]$ setenv ORTE_RELAY_DELAY 2
>> [rhc@odin mpi]$ mpirun -pernode ./hello
>> [odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
>> [odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
>> [odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
>> [odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
>> Hello, World, I am 2 of 16
>> Hello, World, I am 3 of 16
>> Hello, World, I am 4 of 16
>> Hello, World, I am 7 of 16
>> Hello, World, I am 6 of 16
>> Hello, World, I am 0 of 16
>> Hello, World, I am 1 of 16
>> Hello, World, I am 10 of 16
>> Hello, World, I am 5 of 16
>> Hello, World, I am 9 of 16
>> Hello, World, I am 8 of 16
>> Hello, World, I am 14 of 16
>> Hello, World, I am 13 of 16
>> Hello, World, I am 12 of 16
>> Hello, World, I am 11 of 16
>> Hello, World, I am 15 of 16
>> [odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
>> [rhc@odin mpi]$
>> 
>> Sorry I don't have more time to continue pursuing this. I have no idea what 
>> is going on with your system(s), but it clearly is something peculiar to 
>> what you are doing or the system(s) you are running on.
>> 
>> Ralph
>> 
>> 
>> On Dec 2, 2009, at 1:56 AM, Sylvain Jeaugey wrote:
>> 
>>> Ok, so I tried with RHEL5 and I get the same (even at 6 nodes) : when 
>>> setting ORTE_RELAY_DELAY to 1, I get the deadlock systematically with the 
>>> typical stack.
>>> 
>>> Without my "reproducer patch", 80 nodes was the lower bound to reproduce 
>>> the bug (and you needed a couple of runs to get it). But since this is a 
>>> race condition, your mileage may vary on a different cluster.
>>> 
>>> With the patch however, I'm in every time. I'll continue to try different 
>>> configurations (e.g. without slurm ...) to see if I can reproduce it on 
>>> much common configurations.
>>> 
>>> Sylvain
>>> 
>>> On Mon, 30 Nov 2009, Sylvain Jeaugey wrote:
>>> 
 Ok. Maybe I should try on a RHEL5 then.
 
 About the compilers, I've t