Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-16 Thread Ralph Castain
Argh. I know the problem here - per note on user list, I actually found more than five months ago that we weren't properly serializing commands in the system and created a fix for it. I applied that fix only to the comm_spawn scenario at the time as this was the source of the pain - but I noted

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-03 Thread Sylvain Jeaugey
Too bad. But no problem, that's very nice of you to have spent so much time on this. I wish I knew why our experiments are so different, maybe we will find out eventually ... Sylvain On Wed, 2 Dec 2009, Ralph Castain wrote: I'm sorry, Sylvain - I simply cannot replicate this problem (tried

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-02 Thread Ralph Castain
I'm sorry, Sylvain - I simply cannot replicate this problem (tried yet another slurm system): ./configure --prefix=blah --with-platform=contrib/platform/iu/odin/debug [rhc@odin ~]$ salloc -N 16 tcsh salloc: Granted job allocation 75294 [rhc@odin mpi]$ mpirun -pernode ./hello Hello, World, I am 1

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-02 Thread Sylvain Jeaugey
Ok, so I tried with RHEL5 and I get the same (even at 6 nodes) : when setting ORTE_RELAY_DELAY to 1, I get the deadlock systematically with the typical stack. Without my "reproducer patch", 80 nodes was the lower bound to reproduce the bug (and you needed a couple of runs to get it). But since

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Ralph Castain
On Dec 1, 2009, at 5:48 PM, Jeff Squyres wrote: > On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote: > >> > So perhaps it can become a param in the downcall to the MCA base as to >> > whether the priority params should be automatically registered...? >> >> I can live with that, though I again qu

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Jeff Squyres
On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote: > So perhaps it can become a param in the downcall to the MCA base as to whether the priority params should be automatically registered...? I can live with that, though I again question why anything needs to be automatically registered. It

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Ralph Castain
On Dec 1, 2009, at 3:40 PM, Jeff Squyres wrote: > On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote: > >> The only issue with that is it implies there is a param that can be adjusted >> - and there isn't. So it can confuse a user - or even a developer, as it did >> here. >> >> I should think we

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Jeff Squyres
On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote: The only issue with that is it implies there is a param that can be adjusted - and there isn't. So it can confuse a user - or even a developer, as it did here. I should think we wouldn't want MCA to automatically add any parameter. If the c

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Ralph Castain
The only issue with that is it implies there is a param that can be adjusted - and there isn't. So it can confuse a user - or even a developer, as it did here. I should think we wouldn't want MCA to automatically add any parameter. If the component doesn't register it, then it shouldn't exist. T

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Jeff Squyres
This is not a bug, it's a feature. :-) The MCA base automatically adds a priority MCA parameter for every component. On Dec 1, 2009, at 7:40 AM, Ralph Castain wrote: I'm afraid Sylvain is right, and we have a bug in ompi_info: MCA routed: parameter "routed_binomial_priori

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-01 Thread Ralph Castain
I'm afraid Sylvain is right, and we have a bug in ompi_info: MCA routed: parameter "routed_binomial_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_cm_priority" (current value: <0>, data source: default value) MCA

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Jeff Squyres
On Nov 30, 2009, at 10:48 AM, Sylvain Jeaugey wrote: About my previous e-mail, I was wrong about all components having a 0 priority : it was based on default parameters reported by "ompi_info -a | grep routed". It seems that the truth is not always in ompi_info ... ompi_info *does* always

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey
Ok. Maybe I should try on a RHEL5 then. About the compilers, I've tried with both gcc and intel and it doesn't seem to make a difference. On Mon, 30 Nov 2009, Ralph Castain wrote: Interesting. The only difference I see is the FC11 - I haven't seen anyone running on that OS yet. I wonder if t

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Ralph Castain
Interesting. The only difference I see is the FC11 - I haven't seen anyone running on that OS yet. I wonder if that is the source of the trouble? Do we know that our code works on that one? I know we had problems in the past with FC9, for example, that required fixes. Also, what compiler are yo

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey
Hi Ralph, I'm also puzzled :-) Here is what I did today : * download the latest nightly build (openmpi-1.7a1r22241) * untar it * patch it with my "ORTE_RELAY_DELAY" patch * build it directly on the cluster (running FC11) with : ./configure --platform=contrib/platform/lanl/tlcc/debug-nopana

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-27 Thread Ralph Castain
On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote: > Hi Ralph, > > I tried with the trunk and it makes no difference for me. Strange > > Looking at potential differences, I found out something strange. The bug may > have something to do with the "routed" framework. I can reproduce the bug

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-27 Thread Sylvain Jeaugey
Hi Ralph, I tried with the trunk and it makes no difference for me. Looking at potential differences, I found out something strange. The bug may have something to do with the "routed" framework. I can reproduce the bug with binomial and direct, but not with cm and linear (you disabled the bui

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-26 Thread Ralph Castain
Just to clarify something: I have been testing with the trunk, NOT the 1.5 branch. I haven't even bothered to look at that code since it was branched. >From what little I have heard plus what I (and others) have done since the >branch, I strongly suspect a complete ORTE refresh will be required

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-26 Thread Ralph Castain
Hi Sylvain Well, I hate to tell you this, but I cannot reproduce the "bug" even with this code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs really slow as I increase the delay, but it completes the job just fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn,

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Ralph Castain
BTW: does this reproduce on the trunk and/or 1.3.4 as well? I'm wondering because we know the 1.5 branch is skewed relative to the trunk. Could well be a bug sitting over there. On Nov 20, 2009, at 7:06 AM, Ralph Castain wrote: > Thanks! I'll give it a try. > > My tests are all conducted with

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Ralph Castain
Thanks! I'll give it a try. My tests are all conducted with fast launches (just running slurm on large clusters) and using an mpi hello world that calls mpi_init at first instruction. I'll see if adding the delay causes it to misbehave. On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: > Hi

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Sylvain Jeaugey
Hi Ralph, Thanks for your efforts. I will look at our configuration and see how it may differ from ours. Here is a patch which helps reproducing the bug even with a small number of nodes. diff -r b622b9e8f1ac orte/orted/orted_comm.c --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Ralph Castain
Hi Sylvain I've spent several hours trying to replicate the behavior you described on clusters up to a couple of hundred nodes (all running slurm), without success. I'm becoming increasingly convinced that this is a configuration issue as opposed to a code issue. I have enclosed the platform f

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Ralph Castain
On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: > Thank you Ralph for this precious help. > > I setup a quick-and-dirty patch basically postponing process_msg (hence > daemon_collective) until the launch is done. In process_msg, I therefore > requeue a process_msg handler and return. That

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey
Thank you Ralph for this precious help. I setup a quick-and-dirty patch basically postponing process_msg (hence daemon_collective) until the launch is done. In process_msg, I therefore requeue a process_msg handler and return. In this "all-must-be-non-blocking-and-done-through-opal_progress"

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Ralph Castain
Very strange. As I said, we routinely launch jobs spanning several hundred nodes without problem. You can see the platform files for that setup in contrib/platform/lanl/tlcc That said, it is always possible you are hitting some kind of race condition we don't hit. In looking at the code, one po

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey
I would say I use the default settings, i.e. I don't set anything "special" at configure. I'm launching my processes with SLURM (salloc + mpirun). Sylvain On Wed, 18 Nov 2009, Ralph Castain wrote: How did you configure OMPI? What launch mechanism are you using - ssh? On Nov 17, 2009, at 9:

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-18 Thread Ralph Castain
How did you configure OMPI? What launch mechanism are you using - ssh? On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote: > I don't think so, and I'm not doing it explicitely at least. How do I know ? > > Sylvain > > On Tue, 17 Nov 2009, Ralph Castain wrote: > >> We routinely launch across t

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Sylvain Jeaugey
I don't think so, and I'm not doing it explicitely at least. How do I know ? Sylvain On Tue, 17 Nov 2009, Ralph Castain wrote: We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fashion. Did you build and/or are using ORTE threaded by any ch

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Ralph Castain
We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fashion. Did you build and/or are using ORTE threaded by any chance? If so, that definitely won't work. On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: > Hi all, > > We are currently exper