Argh. I know the problem here - per note on user list, I actually found more
than five months ago that we weren't properly serializing commands in the
system and created a fix for it. I applied that fix only to the comm_spawn
scenario at the time as this was the source of the pain - but I noted
Too bad. But no problem, that's very nice of you to have spent so much
time on this.
I wish I knew why our experiments are so different, maybe we will find out
eventually ...
Sylvain
On Wed, 2 Dec 2009, Ralph Castain wrote:
I'm sorry, Sylvain - I simply cannot replicate this problem (tried
I'm sorry, Sylvain - I simply cannot replicate this problem (tried yet another
slurm system):
./configure --prefix=blah --with-platform=contrib/platform/iu/odin/debug
[rhc@odin ~]$ salloc -N 16 tcsh
salloc: Granted job allocation 75294
[rhc@odin mpi]$ mpirun -pernode ./hello
Hello, World, I am 1
Ok, so I tried with RHEL5 and I get the same (even at 6 nodes) : when
setting ORTE_RELAY_DELAY to 1, I get the deadlock systematically with the
typical stack.
Without my "reproducer patch", 80 nodes was the lower bound to reproduce
the bug (and you needed a couple of runs to get it). But since
On Dec 1, 2009, at 5:48 PM, Jeff Squyres wrote:
> On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote:
>
>> > So perhaps it can become a param in the downcall to the MCA base as to
>> > whether the priority params should be automatically registered...?
>>
>> I can live with that, though I again qu
On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote:
> So perhaps it can become a param in the downcall to the MCA base
as to whether the priority params should be automatically
registered...?
I can live with that, though I again question why anything needs to
be automatically registered. It
On Dec 1, 2009, at 3:40 PM, Jeff Squyres wrote:
> On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote:
>
>> The only issue with that is it implies there is a param that can be adjusted
>> - and there isn't. So it can confuse a user - or even a developer, as it did
>> here.
>>
>> I should think we
On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote:
The only issue with that is it implies there is a param that can be
adjusted - and there isn't. So it can confuse a user - or even a
developer, as it did here.
I should think we wouldn't want MCA to automatically add any
parameter. If the c
The only issue with that is it implies there is a param that can be adjusted -
and there isn't. So it can confuse a user - or even a developer, as it did here.
I should think we wouldn't want MCA to automatically add any parameter. If the
component doesn't register it, then it shouldn't exist. T
This is not a bug, it's a feature. :-)
The MCA base automatically adds a priority MCA parameter for every
component.
On Dec 1, 2009, at 7:40 AM, Ralph Castain wrote:
I'm afraid Sylvain is right, and we have a bug in ompi_info:
MCA routed: parameter
"routed_binomial_priori
I'm afraid Sylvain is right, and we have a bug in ompi_info:
MCA routed: parameter "routed_binomial_priority" (current value:
<0>, data source: default value)
MCA routed: parameter "routed_cm_priority" (current value: <0>,
data source: default value)
MCA
On Nov 30, 2009, at 10:48 AM, Sylvain Jeaugey wrote:
About my previous e-mail, I was wrong about all components having a 0
priority : it was based on default parameters reported by "ompi_info
-a |
grep routed". It seems that the truth is not always in ompi_info ...
ompi_info *does* always
Ok. Maybe I should try on a RHEL5 then.
About the compilers, I've tried with both gcc and intel and it doesn't
seem to make a difference.
On Mon, 30 Nov 2009, Ralph Castain wrote:
Interesting. The only difference I see is the FC11 - I haven't seen
anyone running on that OS yet. I wonder if t
Interesting. The only difference I see is the FC11 - I haven't seen anyone
running on that OS yet. I wonder if that is the source of the trouble? Do we
know that our code works on that one? I know we had problems in the past with
FC9, for example, that required fixes.
Also, what compiler are yo
Hi Ralph,
I'm also puzzled :-)
Here is what I did today :
* download the latest nightly build (openmpi-1.7a1r22241)
* untar it
* patch it with my "ORTE_RELAY_DELAY" patch
* build it directly on the cluster (running FC11) with :
./configure --platform=contrib/platform/lanl/tlcc/debug-nopana
On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote:
> Hi Ralph,
>
> I tried with the trunk and it makes no difference for me.
Strange
>
> Looking at potential differences, I found out something strange. The bug may
> have something to do with the "routed" framework. I can reproduce the bug
Hi Ralph,
I tried with the trunk and it makes no difference for me.
Looking at potential differences, I found out something strange. The bug
may have something to do with the "routed" framework. I can reproduce the
bug with binomial and direct, but not with cm and linear (you disabled the
bui
Just to clarify something: I have been testing with the trunk, NOT the 1.5
branch. I haven't even bothered to look at that code since it was branched.
>From what little I have heard plus what I (and others) have done since the
>branch, I strongly suspect a complete ORTE refresh will be required
Hi Sylvain
Well, I hate to tell you this, but I cannot reproduce the "bug" even with this
code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs
really slow as I increase the delay, but it completes the job just fine. I ran
jobs across 16 nodes on a slurm machine, 1-4 ppn,
BTW: does this reproduce on the trunk and/or 1.3.4 as well? I'm wondering
because we know the 1.5 branch is skewed relative to the trunk. Could well be a
bug sitting over there.
On Nov 20, 2009, at 7:06 AM, Ralph Castain wrote:
> Thanks! I'll give it a try.
>
> My tests are all conducted with
Thanks! I'll give it a try.
My tests are all conducted with fast launches (just running slurm on large
clusters) and using an mpi hello world that calls mpi_init at first
instruction. I'll see if adding the delay causes it to misbehave.
On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:
> Hi
Hi Ralph,
Thanks for your efforts. I will look at our configuration and see how it
may differ from ours.
Here is a patch which helps reproducing the bug even with a small number
of nodes.
diff -r b622b9e8f1ac orte/orted/orted_comm.c
--- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +
Hi Sylvain
I've spent several hours trying to replicate the behavior you described on
clusters up to a couple of hundred nodes (all running slurm), without success.
I'm becoming increasingly convinced that this is a configuration issue as
opposed to a code issue.
I have enclosed the platform f
On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:
> Thank you Ralph for this precious help.
>
> I setup a quick-and-dirty patch basically postponing process_msg (hence
> daemon_collective) until the launch is done. In process_msg, I therefore
> requeue a process_msg handler and return.
That
Thank you Ralph for this precious help.
I setup a quick-and-dirty patch basically postponing process_msg (hence
daemon_collective) until the launch is done. In process_msg, I therefore
requeue a process_msg handler and return.
In this "all-must-be-non-blocking-and-done-through-opal_progress"
Very strange. As I said, we routinely launch jobs spanning several hundred
nodes without problem. You can see the platform files for that setup in
contrib/platform/lanl/tlcc
That said, it is always possible you are hitting some kind of race condition we
don't hit. In looking at the code, one po
I would say I use the default settings, i.e. I don't set anything
"special" at configure.
I'm launching my processes with SLURM (salloc + mpirun).
Sylvain
On Wed, 18 Nov 2009, Ralph Castain wrote:
How did you configure OMPI?
What launch mechanism are you using - ssh?
On Nov 17, 2009, at 9:
How did you configure OMPI?
What launch mechanism are you using - ssh?
On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:
> I don't think so, and I'm not doing it explicitely at least. How do I know ?
>
> Sylvain
>
> On Tue, 17 Nov 2009, Ralph Castain wrote:
>
>> We routinely launch across t
I don't think so, and I'm not doing it explicitely at least. How do I
know ?
Sylvain
On Tue, 17 Nov 2009, Ralph Castain wrote:
We routinely launch across thousands of nodes without a problem...I have never
seen it stick in this fashion.
Did you build and/or are using ORTE threaded by any ch
We routinely launch across thousands of nodes without a problem...I have never
seen it stick in this fashion.
Did you build and/or are using ORTE threaded by any chance? If so, that
definitely won't work.
On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:
> Hi all,
>
> We are currently exper
30 matches
Mail list logo